Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I wrote a script that is automating the archiving of specific files based on given parameters. The script is crashing on files that are over 500 MB when searching by "special id" or "date", values that should be in the "special" element. There should only be one "special" element per xml file, but in some cases the xml file was created using bad data and there are multiple "special" elements. In this case, it is alright to just check the first "special" element, check for the given values, and then break if those values are not found.

import argparse, sys, os, shutil

def get_parsed_args():
    parser = argparse.ArgumentParser(description='Archive old data')

    parser.add_argument('-d', action='store', dest='directory', help='main directory for files (optional, assumes current directory if blank and always assumes subdirectories of "current" and "archived")')

    parser.add_argument('-t', action='store', dest='type', help='input type, and enum of one of the following: files, date, specialid')

    parser.add_argument('-l', action='append', dest='dlist', default=[], help='list of the data type given above, although if the type is "date" this will only use the first parameter')

    return parser.parse_args()

def get_directory(results):
    tempdir = ""
    if results.directory:
        tempdir = results.directory
        if len(tempdir) > 0 and directory[-1] != "/":
            tempdir = tempdir + "/"
    return tempdir

def archive_file(fname, cdir, adir):
    print "Archiving file: " + fname
    shutil.move(cdir+fname, adir+fname)

results = get_parsed_args()

valid_types = ['files','date','specialid']

directory = get_directory(results)

curdir = directory + "current/"
arcdir = directory + "archived/"
extension = ".xml"

#check that the directory has current/archived folders in it
if not(os.path.exists(curdir)) or not(os.path.exists(arcdir)):
    print "Error: Missing current or archived folder in directory path"
    sys.exit(0)

if not(results.type) or not(results.type in valid_types):
    print "Error: Invalid or empty type parameter, quitting program"
    sys.exit(0)

dtype = results.type

if len(results.dlist) <= 0:
    print "Error: No list of values given, quitting program"
    sys.exit(0) 

#date should just take the first value, all others will take the list of objects
if dtype == "date":
    dlist = results.dlist[0]
else:
    dlist = results.dlist

if dtype == "files":
    for f in dlist:
        print f
        if os.path.exists(curdir + f):
            archive_file(f, curdir, arcdir)
else:
    from lxml import etree
    xmlparser = etree.XMLParser()
    files = [file for file in os.listdir(curdir) if file.lower().endswith(extension)]
    for f in files:
        print "Checking file: " + f 
        data = etree.parse(open(curdir+f),xmlparser)
        root = data.getroot()
        for element in root.iter("special"):
            if dtype == "specialid":
                if element.get("id") in dlist:
                    archive_file(f, curdir, arcdir)
                break
            elif dtype == "date":
                for e in element.iter("date"):
                    if e.text < dlist:
                        archive_file(f, curdir, arcdir)
                    break
                break

Example XML to input into this:

<?xml version="1.0" encoding="UTF-8"?>
<our_object xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schemaurl" schemaVersion="2.0">
 <source id="0">
  <name>Source Name</name>
  <our_id>05</our_id>
  <datetime>2011-06-29T14:53:52</datetime>
 </source>
 <special id="12345">
  <date>2011-11-08</date>
  <special_type>GenericType</special_type>
  <state_id>05</state_id>
 </special>
 <state id="05">
  <name>StateName</name>
 </state>
 <locality id="001">
  <name>Town1</name>
  <state_id>05</state_id>
  <type>Town</type>
 </locality>
 <locality id="002">
  <name>Town2</name>
  <state_id>05</state_id>
  <type>Town</type>
 </locality>
 <locality id="003">
  <name>Town3</name>
  <state_id>05</state_id>
  <type>Town</type>
 </locality>
</our_object>

Edited as per suggestions below

share|improve this question
add comment

1 Answer

up vote 1 down vote accepted

I'm not real experienced with XML in python so I'll just stick to coding style and such.
I would do things like putting code in methods:

def get_parsed_args():
    parser = argparse.ArgumentParser(description='Archive old election data')

    parser.add_argument('-d', action='store', dest='directory', help='main directory for files (optional, assumes current directory if blank and always assumes subdirectories of "current" and "archived")')

    parser.add_argument('-t', action='store', dest='type', help='input type, and enum of one of the following: files, date, specialid')

    parser.add_argument('-l', action='append', dest='dlist', default=[], help='list of the data type given above, although if the type is "date" this will only use the first parameter')
    return parser.parse_args()

parsed_args = get_parsed_args()

Just on a minimum it would help with the whole global variables that you seem to need so many.

Name things in a more explicit form for example the

for f in files:

isn't as easy to understand as say

for file in files:

or

for current_file in files:

and if they were in their own function you wouldn't need worry about the other variables in different scopes.

also:

print "Archiving file: " + f
shutil.move(curdir+f, arcdir+f)

is used in three place. I'm a fan of the rule of three. (if you use it three or more times then refactor it.)

Just as a note if you had a sample xml it would be easier to understand the presented code. EDIT:-
so after looking at this again I realised you could

  1. extension is used like a constant so it should be EXTENSION (probably same with valid_types ) This is a coding style issue
  2. I would do a with open(curdir + f) as xml_doc: data = etree.parse AFAIK thats how you should open files. always.
  3. I would name f as current_file (or curfile if you wish). Coding style issue.

EDIT2:- found iterparse which is supposed to be quicker.

with open(curdir + current_file) as xml_doc:                               
            context = etree.iterparse(xml_doc,tag="special")                       

the following has all the changes I made. (command: ./mycode.py -t=specialid -l 12345 123456 123 )

#!/usr/bin/python                                                                  
import argparse                                                                    
import sys                                                                         
import os                                                                          
import shutil                                                                      
from lxml import etree                                                             

def get_parsed_args():                                                             
    parser = argparse.ArgumentParser(description='Archive old data')               

    parser.add_argument('-d', required=False, action='store', dest='directory', help='main director
y for files (optional, assumes current directory if blank and always assumes subdir
ectories of "current" and "archived")')                                            

    parser.add_argument('-t',  required=True, action='store', dest='type', help='input type, and en
um of one of the following: files, date, specialid')                               

    parser.add_argument('-l', required=True, nargs='+', dest='dlist', default=[], help='list
 of the data type given above, although if the type is "date" this will only use th
e first parameter')                                                                

    return parser.parse_args()                                                     

def get_directory(results):                                                        
    tempdir = ""                                                                   
    if results.directory:                                                          
        tempdir = results.directory                                                
        if len(tempdir) > 0 and directory[-1] != "/":                              
            tempdir = tempdir + "/"                                                
    return tempdir + "current/", tempdir + "archived/"                             

def archive_file(fname, cdir, adir):                                               
    print "Archiving file: " + fname                                               
    shutil.move(cdir+fname, adir+fname)                                            

results = get_parsed_args()                                                        
valid_types = ['files','date','specialid']                                         

curdir, arcdir = get_directory(results)                                            

EXTENSION = ".xml"                                                                 

#check that the directory has current/archived folders in it                       
if not(os.path.exists(curdir)) or not(os.path.exists(arcdir)):                     
    print "Error: Missing current or archived folder in directory path"            
    sys.exit(0)                                                                    

if not(results.type) or not(results.type in valid_types):                          
    print "Error: Invalid or empty type parameter, quitting program"               
    sys.exit(0)                                                                    

dtype = results.type                                                               

if len(results.dlist) <= 0:                                                        
    print "Error: No list of values given, quitting program"                       
    sys.exit(0)                                                                    

#date should just take the first value, all others will take the list of objects   
if dtype == "date":                                                                
    dlist = results.dlist[0]                                                       
else:                                                                              
    dlist = results.dlist                                                          

if dtype == "files":                                                               
    for f in dlist:                                                                
        print f                                                                    
        if os.path.exists(curdir + f):                                             
            archive_file(f, curdir, arcdir)                                        
else:                                                                              
    files = [file for file in os.listdir(curdir) if file.lower().endswith(EXTENSION
)]                                                                                 
    for current_file in files:                                                     
        print "Checking file: " + current_file                                     
        context = None                                                                

        with open(curdir + current_file) as xml_doc:                               
            context= etree.iterparse(xml_doc, tag="special")                                  

            if context == None:                                                           
                next                      

            for action, element in context:                                       
                if dtype == "specialid":                                               
                    if element.get("id") in dlist:                                     
                        archive_file(current_file, curdir, arcdir)                     
                elif dtype == "date":                                                  
                    for date_element in element.iter("date"):                          
                        if date_element.text > dlist:                                  
                            archive_file(current_file, curdir, arcdir)                 
                        break # first Date only                                        
                break# first special only
share|improve this answer
    
I incorporated many of your comments and provided a sample xml file, what other suggestions would you make as far as making sections of code their own function to clean this up? –  Mike J Aug 24 '11 at 17:35
    
@mike Thanks i'll take a look at the xml. Acoupl of coding style things: the PEP8 states that imports should be at the top of the file and each a seperate import eg import argparse; import os; etcetc; from from lxml import etree –  James Khoury Aug 24 '11 at 23:44
    
python.org/dev/peps/pep-0008 for the PEP8 –  James Khoury Aug 25 '11 at 0:04
    
@Mike I've updated my answer. Not sure if it will help your 500mb files though. –  James Khoury Aug 26 '11 at 6:10
1  
thanks for your response, iterparse cut down substantial on the time to find specific elements. Using that I did a google search and also found this link ibm.com/developerworks/xml/library/x-hiperfparse which led me to an even more optimal solution. Thanks for your help in solving this issue. –  Mike J Aug 26 '11 at 14:46
show 3 more comments

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.