How to increase the performance of loop in Python?

Question

I need to run over a log file with plenty of entries (25+GB) in Python to extract some log times.

The snippet below works.

On my test file, I'm ending up with roughly 9:30min (Python) and 4:30min (PyPy) on 28 million records (small log).

import datetime
import time
import json
import re

format_path= re.compile( 
    r"^\S+\s-\s" 
    r"(?P<user>\S*)\s"
    r"\[(?P<time>.*?)\]" 
)

threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')
tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()
zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()
format_date = '%d/%b/%Y:%H:%M:%S'

obj = {}
test = open('very/big/log','r')

for line in test:
  try:
      chunk = line.split('+', 1)[0].split('-', 1)[1].split(' ')
      user = chunk[1]
      if (user[0] == "C"):
        this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
        try:
            machine = obj[user]
        except KeyError, e:
            machine = obj.setdefault(user,{"init":this_time,"last":this_time,"downtime":0})
        last = machine["last"]
        diff = (this_time-last).total_seconds()
        if (diff > tick):
          machine["downtime"] += diff-tick
        machine["last"] = this_time

  except Exception:
    pass

While I'm ok with the time the script runs, I'm wondering if there are any obvious pitfalls regarding performance, which I'm stepping in (still in my first weeks of Python).

Question:
Can I do anything to make this run faster?

EDIT: Sample log entry:

['2001:470:1f14:169:15f3:824f:8a61:7b59 - SOFTINST [14/Nov/2012:09:32:31 +0100] "POST /setComputerPartition HTTP/1.1" 200 4 "-" "-" 102356']

EDIT2:
One thing I just did, was to check if log entries have the same log time and if so, I'm using the this_time from the previous iteration vs calling this:

 this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)

I checked some of logs, there are plenty of entries with the exact same time, so on the last run on pypy I'm down to 2.43min.

hdima · Accepted Answer · 2013-03-08 12:33:41Z

You can use regular expressions with re module to speed up parsing dramatically. For example the following code extracts user and time information by using regular expression (note that user[0] == "C" condition also checked by the regexp):

import re
import datetime

match_record = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+)").match
strptime = datetime.datetime.strptime

f = open("very/big/log", "rb")

for line in f:
    match = match_record(line)
    if match is not None:
        user, str_time = match.groups()
        time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
        print user, repr(time)

I think you can make it even faster by reading bigger chunks of data and using re.finditer method.

Don't use dictionaries to store information in the obj dictionary. Tuples is the way to go here and if you really need to use names for fields you can use collections.namedtuple. In any case tuples will be faster than dictionaries (and lists) and will take less memory.
Probably if user in obj will be little bit faster than try/except KeyError here. You can check execution times for small code blocks with timeit module.
Check the Python performance tips. For example Avoiding dots... chapter can be relevant for your code.

UPDATE

The following version of the code read bigger chunks of data and using re.finditer (Not sure if it faster or not. Also note the changes in the regexp):

import re
import datetime

iter_records = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+) .*$", re.M).finditer
strptime = datetime.datetime.strptime

f = open("very/big/log", "rb")

chunk = ""
while True:
    # Probably buffer size can be even bigger
    new = f.read(1048576)
    chunk += new
    end = 0
    for match in iter_records(chunk):
        user, str_time = match.groups()
        time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
        print user, repr(time)
        end = match.end()
    if end > 0:
        chunk = chunk[end:]
    elif not new:
        break

ok. using regular expression with pypy and 28m rows cuts 2.43mins to 1.47mins, users in obj is 5seconds slower on the test set and namedtuples... I'm trying now
regarding namedtuples - the values in my tuple will be changed on every iteration plus I need to add 3 more values once the loop is done. From what I understand namedtuples are not changeable "out-of-the-box". So should I use dict, namedtuple or a list?
@frequent You would have to construct a new tuple with updated values and assign that to obj[user].
@JanneKarila: so I'm thinking. But creating a new tuple & removing the old one is probably more resource-eating than list or dict (or set)

Winston Ewert · Answer 2 · 2013-03-07 17:00:47Z

import datetime
import time
import json
import re

format_path= re.compile( 
    r"^\S+\s-\s" 
    r"(?P<user>\S*)\s"
    r"\[(?P<time>.*?)\]" 
)

Python convention is ALL_CAPS for global constants. But as far as I can tell, you don't actually use this.

threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')

Why are you putting in a string literal just to throw half of it away? Why are you converting from a string rather then creating the value directly? tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()

That's a way over complicated way to accomplish this. Just say tick = 60 and skip all the rigamorle.

zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()

You don't seem to use this...

format_date = '%d/%b/%Y:%H:%M:%S'



obj = {}
test = open('very/big/log','r')

Really bad names. They don't tell me anything. Also, it's recommended to deal with files using a with statement.

for line in test:
  try:
      chunk = line.split('+', 1)[0].split('-', 1)[1].split(' ')

That seems way complicated. There is probably a better way to extract the data but without knowing what this log looks like I can't really suggest one.

      user = chunk[1]
      if (user[0] == "C"):

You don't need the ( and )

        this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
        try:
            machine = obj[user]
        except KeyError, e:
            machine = obj.setdefault(user,{"init":this_time,"last":this_time,"downtime":0})

Don't use dictionaries random attribute storage. Create an a class and put stuff in there.

        last = machine["last"]
        diff = (this_time-last).total_seconds()
        if (diff > tick):

You don't need the parens, and I'd combine the last two lines

          machine["downtime"] += diff-tick
        machine["last"] = this_time

  except Exception:
    pass

Don't ever do this. DON'T EVER DO THIS. You should never never catch and then ignore all possible exceptions. Invariably, you'll end up with a bug that you'll miss because you are ignoring the exceptions.

As for making it run faster, we'd really need to know more about the log file's contents.

wow. Lot of input. Let me digest. I also add a sample record above plus one thing I just did

Janne Karila · Answer 3 · 2013-03-07 20:56:04Z

if (user[0] == "C"):

If the above statement skips a large majority of lines, you can speed things up by doing a preliminary check up front:

if ' - C' in line:

This may give false positives, so follow up with a better check. The regex proposed by hdima looks like the way to go.

asked	2 months ago
viewed	179 times
active	2 months ago

How to increase the performance of loop in Python?

3 Answers

UPDATE

Your Answer

Not the answer you're looking for? Browse other questions tagged python performance loop or ask your own question.

How to increase the performance of loop in Python?

3 Answers

UPDATE

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python performance loop or ask your own question.

Related