I wrote a Python program that performs an integration of data using Simpson's rule. The program takes the areas to be integrated from a foo.ref file, with the following syntax:
# peak m/z charge window LAG_H4N4F1S1 4244.829467 1+ 0.06 LAG_H4N4F1S1 4245.832508 1+ 0.06 LAG_H4N4F1S1 4246.835549 1+ 0.06 LAG_H4N4F1S1 4247.838590 1+ 0.06
The data that the program parses is taken from all the .xy files in the current directory, with the following syntax:
3497.7552 0 3497.7619 410646 3497.7685 397637 3497.7752 363672 3497.7819 321723 3497.7886 286960
The program's final output is a tab seperated file that can be opened directly in excel, the performance is ... a lot worse than I was expecting however and I was wondering if anyone could offer some tricks to increase that (and also to have a general look at the code, if it is correct Python).
#! /usr/bin/env python
import glob
import os
import fileinput
print "<SimpInt.py> a script to quickly calculate the area for a set"
print "of analytes. Written by Bas Jansen, Leiden University Medical"
print "Center, March 20th 2013."
print ""
# Simpson's rule
def integrate(y_vals,h):
i=1
total=y_vals[0]+y_vals[-1]
for y in y_vals[1:-1]:
if i%2 == 0:
total+=2*y
else:
total+=4*y
i+=1
return total*(h/3.0)
# Reads the reference file
ref=[]
for refs in glob.glob("*.ref"):
fr=open(refs)
for line in fr.readlines():
parts=line.rstrip('\n').split('\t')
ref.append(parts)
fr.close()
# Initialize the output file
f=open('output.xls','w')
for i in range(1,len(ref)):
f.write(ref[i][0]+"\n")
f.close()
# Find XY files in current directory
xy_files=[]
for files in glob.glob("*.xy"):
xy_files.append(files)
x=[]
area=[]
print "Parsing: "+files
fh=open(files)
for line in fh.readlines():
y=[float(i) for i in line.split()]
x.append(y)
fh.close()
# Pick the XY window
print "Progress indicator:"
print "0%"
for i in range(1,len(ref)):
obs=[]
vals=0
low=float(ref[i][1])-float(ref[i][3])
high=float(ref[i][1])+float(ref[i][3])
# Progress indicator
if i %(len(ref)/10) == 0:
print str(i/(len(ref)/10))+"0%"
for j in range(0,len(x)):
if x[j][0] > low and x[j][0] < high:
coords=[float(l) for l in x[j]]
obs.append(coords)
vals+=1
if obs:
y_vals=[]
h=(obs[len(obs)-1][0]-obs[0][0])/len(obs)
for k in range(0,len(obs)):
y_vals.append(obs[k][1])
area.append(integrate(y_vals,h))
# Neat trick to append the area from other input files
l = 0
for line in fileinput.input('output.xls',inplace=1):
print "%s\t%s\n" % (line.rstrip(),area[l]),
l+=1
# Display the legend (& filenames) at bottom of output file
f=open('output.xls','a')
legend = "Peak"
for m in range(0,len(xy_files)):
legend+="\t"+str(xy_files[m])
legend+="\n"
f.write(legend)
f.close()
# Being polite
raw_input("Finished, press any key to exit")
I have performed some various tests today with disabling chunks of the code and seeing what causes the biggest increase in processing time and that appears the searching of the lists.