Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

My first useful projects as a programmer has been python scripts that parse out relevant information from log files and do some analysis. I've bumped around and found my way to some functional solutions, but have a sneaking suspicion there are more efficient approaches.

I will outline my current process with the 4 basic steps:

  1. Clean up the source data: In the more involved scenarios I have a text file that is generally some sort of CSV variant. "Generally" because it might require a first pass to clean up outlier situations before I can effectively use the CSV module.
  2. Write clean data to temporary text file: After cleaning up each line, I write the line to a fresh text file.
  3. Read in clean formatted temp text file using the CSV module: I've assumed that reading the data in using the standard CSV module would be reasonably efficient method and ideal because then I can easily extract values from specific columns in each line.
  4. Extract relevant values: Now I can easily traverse the whole file grabbing relevant data. I append the data to lists which I use to do the actual analysis.

The big red flag for me is that I'm traversing all of my data so many times. Maybe I should spend more time trying to find patterns in the data so I can extract the important values on the first pass? Also with larger logs (20,000+ lines) one of my scripts takes 15-30 seconds. That seems rather slow.

What are areas of optimization? Be it a modification of the current design, or a completely different approach.

share|improve this question

closed as too broad by gnat, MichaelT, GlenH7, Bart van Ingen Schenau, Michael Kohne Feb 12 at 13:39

There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs.If this question can be reworded to fit the rules in the help center, please edit the question.

    
I would usually do something like: cleanup file.txt | analyze to avoid the temporary file. –  U2EF1 Feb 9 at 1:00

1 Answer 1

A slightly different approach I would take would be to stuff the cleaned data into a database rather than back into another flat file. From there you can do much easier data traversal and querying rather than having to take multiple passes through files.

share|improve this answer
    
One restriction is I have been using pyinstaller to create a portable executable. Seems using a database would complicate this? Also, I don't have database experience yet. Could you point in the direction of something basic that plays nice with python? –  Anticipation Feb 7 at 21:28
    
Sqllite is a good option here -- it is pretty self-contained and can be distributed in a portable manner. –  Wyatt Barnett Feb 7 at 23:08

Not the answer you're looking for? Browse other questions tagged or ask your own question.