My first useful projects as a programmer has been python scripts that parse out relevant information from log files and do some analysis. I've bumped around and found my way to some functional solutions, but have a sneaking suspicion there are more efficient approaches.
I will outline my current process with the 4 basic steps:
- Clean up the source data: In the more involved scenarios I have a text file that is generally some sort of CSV variant. "Generally" because it might require a first pass to clean up outlier situations before I can effectively use the CSV module.
- Write clean data to temporary text file: After cleaning up each line, I write the line to a fresh text file.
- Read in clean formatted temp text file using the CSV module: I've assumed that reading the data in using the standard CSV module would be reasonably efficient method and ideal because then I can easily extract values from specific columns in each line.
- Extract relevant values: Now I can easily traverse the whole file grabbing relevant data. I append the data to lists which I use to do the actual analysis.
The big red flag for me is that I'm traversing all of my data so many times. Maybe I should spend more time trying to find patterns in the data so I can extract the important values on the first pass? Also with larger logs (20,000+ lines) one of my scripts takes 15-30 seconds. That seems rather slow.
What are areas of optimization? Be it a modification of the current design, or a completely different approach.
cleanup file.txt | analyze
to avoid the temporary file. – U2EF1 Feb 9 '14 at 1:00