Python: Optimisation of Autocorrelation function

Question

New to coding and have written a script from scratch to do an autocorrelation function
it's too slow, I have to run it on a few thousand files over the next few days
I'm sure it's terribly written (and you can see my brain ticking in basic ways through it), if anyone can speed it's performance I'd be incredibly appreciative.

I know that python has an inbuilt autocorrelation function, but I think it's restricted to the sense that they use the word in physics, this is for checking residency times of a molecule in a biomolecule simulation and I couldn't see how to apply the default.

import sys

import time
start = time.time()

fileName = sys.argv[1]

time = []
water = []

#make two 1d arrays for time and water

with open(fileName, "r") as input:
    for line in input:
        time.append(line.split()[0])
        water.append(line.split()[1])

# define a totaltimescale in picoseconds - the minus 1 is 
# because of the zero index
# windowsize is 2 ns

totaltimescale = 50000 - 1
windowsize = 2000

# endtau should take into consideration the multiplier
# below so 400 with a multiplier of 5 gives us 

endtau = 400
alltau = range(1, endtau)

for tau in alltau:

# there is a tau multiplier (i.e. use every 5th tau
# or every 5 picoseconds to speed up calculation)

    tau = 5*tau
    print tau

# these three values just start counters

    counter = 0
    noncounter = 0
    output = 0

# to set up a value for total number of windows we're interested in,
# leave off 2 times the windowsize at the end arbitrarily so it doesn't 
# jump too far, may get errors eventually for this so may have to be wary

    numberofwindows = totaltimescale - 2*windowsize
    listofwindowstartingpoints = range(0, numberofwindows)

# outer iterator which ensures the window (of size start - start + windowsize)
# is moved down one unit each time, i.e. starts at the next picosecond

    for startingpoint in listofwindowstartingpoints:
        window = range(startingpoint, startingpoint + windowsize)
        for a in window:
            if a + tau < totaltimescale:
                if water[a] == water[a + tau]:
                counter = float(counter + 1)
                else:
                    noncounter = noncounter + 1
                totaljumpsanalysed = float(counter + noncounter)
                output = counter/totaljumpsanalysed
            else:
                pass

# counters should have kept values

    print counter
    print totaljumpsanalysed
    print output

    with open(fileName + '.autocorrelate', "a") as outfile:
        outfile.writelines('{0}     {1}{2}'.format(tau, output, '\n'))

print 'It took', time.time()-start, 'seconds.'

It's running on files that look like this (but the time column, the first one, goes up to 50000, the second column is the index of the molecule that goes into the spot)

1.0       24561     0.1255
2.0       24561     0.1267
3.0          -1     0.2265
4.0       24561     0.1095
5.0       24561     0.1263
6.0          -1     0.1598
7.0          -1     0.2024
8.0          -1     0.1798
9.0          -1     0.1823
10.0      24409     0.1291
11.0      24409     0.1288
12.0         -1     0.1537
13.0         -1     0.1853
14.0         -1     0.1625
15.0      24561     0.1300
16.0      24561     0.1342
17.0      24221     0.1294
18.0      24561     0.1165
19.0      24561     0.0611
20.0      24561     0.1259
21.0       6345     0.1284
22.0         -1     0.1421
23.0         -1     0.1435
24.0         -1     0.2599
25.0         -1     0.2659
26.0         -1     0.1961
27.0      24213     0.1398
28.0      24213     0.1325
29.0         -1     0.1809
30.0         -1     0.2044

and the output typically looks a bit something like this where the first column is tau and the second 'output' or the probability that it's the molecule with the same index at that point tau

5       0.674404921846
10      0.59412323094
15      0.543056729494
20      0.503867421031
25      0.470009065414
30      0.442317789517
35      0.417826474489
40      0.398214015522
45      0.379222841801
50      0.360768038436
55      0.344779093024
60      0.330382117003
65      0.316864844888
70      0.306959596948
75      0.29386854062

Thanks guys!

For a beginner this is incredibly well-written, clean code. I see very few points that can be trivially improved, besides the suggestions made in the answers below. — Konrad Rudolph, Aug 8 '12 at 14:19

Kevin · Accepted Answer · 2012-08-08 12:45:48Z

My first suggestion is to look into parallelization. I believe your problem is one that can be made embarassingly parallel. Each iteration of your loop is independent of each other iteration, so you could run all of them simultaneously. With sufficiently beefy hardware you could get an increase in efficiency proportional to len(alltau) * len(window).

Also, you're doing some unnecessary work in your big loop.

You're calculating output fifty thousand times, when you could just calculate it exactly once after the loop ends.
You can keep track of totaljumpsanalysed by incrementing it once per loop. No noncounter needed.
You don't necessarily need to convert counter to a float. Remember, python supports integers of arbitrary size, so you don't need to worry about overflow.
else: pass does nothing, so you can remove it.

totaljumpsanalysed = 0
for startingpoint in listofwindowstartingpoints:
    window = range(startingpoint, startingpoint + windowsize)
    for a in window:
        if a + tau < totaltimescale:
            if water[a] == water[a + tau]:
                counter += 1
            totaljumpsanalysed += 1

output = float(counter)/float(totaljumpsanalysed)

Thanks very much for the help! Now I'm getting an overflow error though codeTraceback (most recent call last): File "autocorrelationbatchtesting_08_08_2012_codereview.py", line 54, in <module> if water[a] == water[a + tau]: IndexError: list index out of range`code` It happens at 40000 now, I wrote a temp file to check, I can't think why that would be given there is the if clause in there? — cc211, Aug 8 '12 at 15:16
How many lines are in your input file? If len(water) is less than totaltimescale, then your if guard won't prevent you from going beyond the end of the list. — Kevin, Aug 8 '12 at 15:34

Hooked · Answer 2 · 2012-08-08 13:40:19Z

If you are dead-set on writting your own AC function, there is one suggestion that is better than any of those below or those by Kevin in the other post. USE NUMPY. The code is begging for the speedups by the numpy indexing properties. If pure Python speed is a concern there a couple of simple things to note:

Since it looks like Python < 3.0, change range -> xrange. The former creates the list, the latter creates an iterator.
a + tau is computed multiple times every cycle of the inner loop.

asked	1 year ago
viewed	563 times
active	1 year ago

current community

your communities

more stack exchange communities

Python: Optimisation of Autocorrelation function

migrated from stackoverflow.com Aug 8 '12 at 12:14

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python optimization or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Python: Optimisation of Autocorrelation function

migrated from stackoverflow.com Aug 8 '12 at 12:14

2 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python optimization or ask your own question.

Related

Hot Network Questions