Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I wrote a Python program that performs an integration of data using Simpson's rule. The program takes the areas to be integrated from a foo.ref file, with the following syntax:

# peak          m/z             charge  window
LAG_H4N4F1S1    4244.829467     1+      0.06
LAG_H4N4F1S1    4245.832508     1+      0.06
LAG_H4N4F1S1    4246.835549     1+      0.06
LAG_H4N4F1S1    4247.838590     1+      0.06

The data that the program parses is taken from all the .xy files in the current directory, with the following syntax:

3497.7552 0
3497.7619 410646
3497.7685 397637
3497.7752 363672
3497.7819 321723
3497.7886 286960

The program's final output is a tab seperated file that can be opened directly in excel, the performance is ... a lot worse than I was expecting however and I was wondering if anyone could offer some tricks to increase that (and also to have a general look at the code, if it is correct Python).

#! /usr/bin/env python
import glob
import os
import fileinput
print "<SimpInt.py> a script to quickly calculate the area for a set"
print "of analytes. Written by Bas Jansen, Leiden University Medical"
print "Center, March 20th 2013."
print ""

# Simpson's rule
def integrate(y_vals,h):
  i=1
  total=y_vals[0]+y_vals[-1]
  for y in y_vals[1:-1]:
    if i%2 == 0:
      total+=2*y
    else:
      total+=4*y
    i+=1
  return total*(h/3.0)

# Reads the reference file
ref=[]
for refs in glob.glob("*.ref"):
  fr=open(refs)
  for line in fr.readlines():
    parts=line.rstrip('\n').split('\t')
    ref.append(parts)
  fr.close()

# Initialize the output file
f=open('output.xls','w')
for i in range(1,len(ref)):
  f.write(ref[i][0]+"\n")
f.close()

# Find XY files in current directory
xy_files=[]
for files in glob.glob("*.xy"):
  xy_files.append(files)
  x=[]
  area=[]
  print "Parsing: "+files
  fh=open(files)
  for line in fh.readlines():
    y=[float(i) for i in line.split()]
    x.append(y)
  fh.close()

  # Pick the XY window
  print "Progress indicator:"
  print "0%"
  for i in range(1,len(ref)):
    obs=[]
    vals=0
    low=float(ref[i][1])-float(ref[i][3])
    high=float(ref[i][1])+float(ref[i][3])
    # Progress indicator
    if i %(len(ref)/10) == 0:
      print str(i/(len(ref)/10))+"0%"
    for j in range(0,len(x)):
      if x[j][0] > low and x[j][0] < high:
        coords=[float(l) for l in x[j]]
        obs.append(coords)
        vals+=1
    if obs:
      y_vals=[]
      h=(obs[len(obs)-1][0]-obs[0][0])/len(obs)
      for k in range(0,len(obs)):
        y_vals.append(obs[k][1])
      area.append(integrate(y_vals,h))

  # Neat trick to append the area from other input files
  l = 0
  for line in fileinput.input('output.xls',inplace=1):
    print "%s\t%s\n" % (line.rstrip(),area[l]),
    l+=1

# Display the legend (& filenames) at bottom of output file
f=open('output.xls','a')
legend = "Peak"
for m in range(0,len(xy_files)):
  legend+="\t"+str(xy_files[m])
legend+="\n"
f.write(legend)
f.close()

# Being polite
raw_input("Finished, press any key to exit")

I have performed some various tests today with disabling chunks of the code and seeing what causes the biggest increase in processing time and that appears the searching of the lists.

share|improve this question
1  
Use NumPy and SciPy for numerical tasks. scipy.integrate.simps –  Janne Karila Mar 21 '13 at 8:14
add comment

1 Answer

up vote 2 down vote accepted
#! /usr/bin/env python
import glob
import os
import fileinput
print "<SimpInt.py> a script to quickly calculate the area for a set"
print "of analytes. Written by Bas Jansen, Leiden University Medical"
print "Center, March 20th 2013."
print ""

# Simpson's rule
def integrate(y_vals,h):

The recommendation for python is four spaces for indentation, you appear to be using 2

  i=1
  total=y_vals[0]+y_vals[-1]

It's also recommended to have spaces around binary operators

  for y in y_vals[1:-1]:

Use for i, y in enumerate(y_vals, 1): rather then counting your own i value

    if i%2 == 0:
      total+=2*y
    else:
      total+=4*y
    i+=1
  return total*(h/3.0)

# Reads the reference file
ref=[]

I'd recommended not using an abbreviation her

for refs in glob.glob("*.ref"):
  fr=open(refs)

It's recommended to use a with block:

  with open(refs) as fr:
      do stuff with file
  file will be automatically closed here

That ensures your file will be closed even if an exception occours

  for line in fr.readlines():
    parts=line.rstrip('\n').split('\t')
    ref.append(parts)

You shouldn't really store the header ref here.

  fr.close()



# Initialize the output file
f=open('output.xls','w')
for i in range(1,len(ref)):
  f.write(ref[i][0]+"\n")
f.close()



# Find XY files in current directory
xy_files=[]
for files in glob.glob("*.xy"):
  xy_files.append(files)
  x=[]
  area=[]
  print "Parsing: "+files
  fh=open(files)
  for line in fh.readlines():

Actually this is the same as for line in fh:

    y=[float(i) for i in line.split()]
    x.append(y)
  fh.close()

  # Pick the XY window
  print "Progress indicator:"
  print "0%"
  for i in range(1,len(ref)):

Whenever possible, don't iterate over indexes. Instead iterate over the lists. So something like for line in ref[1:]: If you really need the indexes us enumerate

    obs=[]
    vals=0
    low=float(ref[i][1])-float(ref[i][3])
    high=float(ref[i][1])+float(ref[i][3])
    # Progress indicator
    if i %(len(ref)/10) == 0:
      print str(i/(len(ref)/10))+"0%"

Why divide by 10 just to add the 0 back?

    for j in range(0,len(x)):

Use for j in x: and then avoid the indexing

      if x[j][0] > low and x[j][0] < high:
        coords=[float(l) for l in x[j]]
        obs.append(coords)
        vals+=1

What's this for?

    if obs:
      y_vals=[]
      h=(obs[len(obs)-1][0]-obs[0][0])/len(obs)
      for k in range(0,len(obs)):
        y_vals.append(obs[k][1])
      area.append(integrate(y_vals,h))

  # Neat trick to append the area from other input files
  l = 0
  for line in fileinput.input('output.xls',inplace=1):

If you need to iterate over multiple sources at once, use a zip.

    print "%s\t%s\n" % (line.rstrip(),area[l]),
    l+=1

Your neat trick is what's killing you. This is designed for processing the output of some other program. Its not good for continually adding to the end of lines you have already written. What you should do is store everything in a big list and then output it at the end.

# Display the legend (& filenames) at bottom of output file
f=open('output.xls','a')
legend = "Peak"
for m in range(0,len(xy_files)):
  legend+="\t"+str(xy_files[m])
legend+="\n"
f.write(legend)
f.close()

# Being polite
raw_input("Finished, press any key to exit")
share|improve this answer
    
That's a lot of comments, thank you :) The reason why I use my 'trick' is that I don't know how many data files a user wants to parse, typical numbers would be between 100 to 10.000 files and as such I assumed that keeping that all in memory is not really an option, right? –  Bas Jansen Mar 20 '13 at 17:39
    
@BasJansen, actually I think you'd be fine to store that much stuff in memory. If not, it still make sense to rearrange the program to output things one line of output at a time rather then what you've done. –  Winston Ewert Mar 20 '13 at 17:54
    
I divide by 10 and then add the 0% back to ensure that it actually lists 10%, 20% and so on, removing the division will just display 00%, 00% and so on. I will use your other comments about indexing, with open (for files) and more however so thank you –  Bas Jansen Mar 21 '13 at 8:58
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.