I need to run over a log file with plenty of entries (25+GB) in Python to extract some log times.
The snippet below works.
On my test file, I'm ending up with roughly 9:30min
(Python) and 4:30min
(PyPy) on 28 million records (small log).
import datetime
import time
import json
import re
format_path= re.compile(
r"^\S+\s-\s"
r"(?P<user>\S*)\s"
r"\[(?P<time>.*?)\]"
)
threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')
tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()
zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()
format_date = '%d/%b/%Y:%H:%M:%S'
obj = {}
test = open('very/big/log','r')
for line in test:
try:
chunk = line.split('+', 1)[0].split('-', 1)[1].split(' ')
user = chunk[1]
if (user[0] == "C"):
this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
try:
machine = obj[user]
except KeyError, e:
machine = obj.setdefault(user,{"init":this_time,"last":this_time,"downtime":0})
last = machine["last"]
diff = (this_time-last).total_seconds()
if (diff > tick):
machine["downtime"] += diff-tick
machine["last"] = this_time
except Exception:
pass
While I'm ok with the time the script runs, I'm wondering if there are any obvious pitfalls regarding performance, which I'm stepping in (still in my first weeks of Python).
Question:
Can I do anything to make this run faster?
EDIT: Sample log entry:
['2001:470:1f14:169:15f3:824f:8a61:7b59 - SOFTINST [14/Nov/2012:09:32:31 +0100] "POST /setComputerPartition HTTP/1.1" 200 4 "-" "-" 102356']
EDIT2:
One thing I just did, was to check if log entries have the same log time and if so, I'm using the this_time
from the previous iteration vs calling this:
this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
I checked some of logs, there are plenty of entries with the exact same time, so on the last run on pypy
I'm down to 2.43min
.