Someone thinks poorly of my server log parser

Question

I have just been informed that the following code written by me is extremely poor. I have absolutely no idea why. It is memory efficient, and looks clean to me. But still the feedback is very poor. I have no clue on why. If someone can put some comments, I will be highly grateful. I have to pass file name from command line to make it work - actually that is what they asked.

import re
import numpy as np
from collections import Counter
import sys

def parse_server_log(path_to_file):

    #First check whether it's a legal file name
    try:
        f = open(path_to_file)
    except:
        print "\nInvalid file name and/or path. Please correct!\n"
        return
    # end of sanity check on file

    # The following regexes would extract the concerned values from each line
    # of the file.
    method_regx = re.compile("(method=)+[A-Z]+[\s]+")
    path_regx = re.compile("(path=)+[/\w.\"]+[\s]+")
    dyno_regx = re.compile("(dyno=)+[\w.]+[\s]+")
    connect_regx = re.compile("(connect=)+[\w.]+[\s]+")
    service_regx = re.compile("(service=)+[\w.]+[\s]+")

    # Target values for each urls
    url1 = [0, [], []]
    url2 = [0, [], []]
    url3 = [0, [], []]
    url4 = [0, [], []]
    url5 = [0, [], []]
    url6 = [0, [], []]

    # url matcher regex for each url type
    url1_regex = re.compile("(#)+(/api/users/)+[\d]+(/count_pending_messages)+(#)+")
    url2_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_messages)+(#)+")
    url3_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_friends_progress)+(#)+")
    url4_regex = re.compile("(#)+(/api/users/)+[\d]+(/get_friends_score)+(#)+")
    url5_6_regex = re.compile("(#)+(/api/users/)+[\d]+(#)+")


    with open(path_to_file) as lines:
        for my_data in lines:

            # Now lets separate out the method, path, dyno, connect and service times
            k = method_regx.search(my_data)
            try:            
                line_method = k.group(0).split("=")[1].strip()
            except:
                line_method = ""

            k = path_regx.search(my_data)
            try:
                # The hashes are added at the start and end to make sure the path
                # is not a part of a string rather the entire string!
                line_path = "#" + k.group(0).split("=")[1].strip()  + "#"            
            except:
                line_path = ""

            k = dyno_regx.search(my_data)
            try:            
                line_dyno = k.group(0).split("=")[1].strip()          
            except:
                line_dyno = ""

            k = connect_regx.search(my_data)
            try:            
                line_connect_time = float(k.group(0).split("=")[1].split("ms")[0])            
            except:
                line_connect_time = 0

            k = service_regx.search(my_data)
            try:            
                line_service_time = float(k.group(0).split("=")[1].split("ms")[0])             
            except:
                line_service_time = 0

            # End of getting all the data

            # Now match up the URL and do this under sanity checker
            if(len(line_method) > 0 and len(line_path) > 0):
                url_denoter = 0
                if url1_regex.match(line_path) is not None:
                    url_denoter = 1
                elif url2_regex.match(line_path) is not None:
                    url_denoter = 2
                elif url3_regex.match(line_path) is not None:
                    url_denoter = 3
                elif url4_regex.match(line_path) is not None:
                    url_denoter = 4
                elif url5_6_regex.match(line_path) is not None:
                    url_denoter = 5            



                # OK so now we have the url to which the current url matched
                if(url_denoter==1 and line_method=="GET"):
                    """
                    for GET /api/users/{user_id}/count_pending_messages
                   """
                    url1[0] += 1
                    url1[1].append(line_dyno)
                    url1[2].append(line_connect_time + line_service_time)

                elif(url_denoter==2 and line_method=="GET"):
                    """
                    for GET /api/users/{user_id}/get_messages
                    """
                    url2[0] += 1
                    url2[1].append(line_dyno)
                    url2[2].append(line_connect_time + line_service_time)


    """
    Now print the results!

    """

    # for GET /api/users/{user_id}/count_pending_messages
    print "\n------GET /api/users/{user_id}/count_pending_messages----\n"
    print "Number of times the url is called: ", url1[0]
    if(url1[0]>0):
        my_num_list = url1[2]
        print "Average response time: ", round(np.average(my_num_list), 2), " in ms."
        print "Median response time: ", round(np.median(my_num_list), 2), " in ms."
        print "Mode of response time: ", round(Counter(my_num_list).most_common(1)[0][0], 2), " in ms."
        counts = [(x, url1[1].count(x)) for x in set(url1[1])]
        swap = 0
        tar_dyno = ""
        for count in counts:
            if(count[1]> swap):
                swap = count[1]
                tar_dyno = count[0]

        print "Dyno that responded the most: ", tar_dyno   
    else:
        print "Sorry no parameters can be calculated as the url has not been accessed!"


    print "\n------GET /api/users/{user_id}/get_messages----\n"
    print "Number of times the url is called: ", url2[0]
    if(url2[0]>0):
        my_num_list = url2[2]
        print "Average response time: ", round(np.average(my_num_list), 2), " in ms."
        print "Median response time: ", round(np.median(my_num_list), 2), " in ms."
        print "Mode of response time: ", round(Counter(my_num_list).most_common(1)[0][0], 2), " in ms."
        counts = [(x, url2[1].count(x)) for x in set(url2[1])]
        swap = 0
        tar_dyno = ""
        for count in counts:
            if(count[1]> swap):
                swap = count[1]
                tar_dyno = count[0]

        print "Dyno that responded the most: ", tar_dyno   
    else:
        print "Sorry no parameters can be calculated as the url has not beenaccessed!"


    print "\n------GET /api/users/{user_id}/get_friends_progress----\n"
    print "Number of times the url is called: ", url3[0]


if(__name__=="__main__"):
    parse_server_log(sys.argv[1])

Although you already have an excellent answer, I would encourage you to discuss the matter with the person who reviewed your code. In particular, did they give you specific feedback, does it line up with the feedback you've gotten here, and how can you best apply the feedback to this and future projects? — GalacticCowboy, 5 hours ago
Your code cuts up logs into something useful. Why didn't you call it LumberMill? #missedopportunity — corsiKa, 2 hours ago
your regex are weird. For example (#)+(/api/users/)+[\d]+(#)+ indicates that each group can be present several times? I don't think + does what you think it does. Also, later you are not using the fact that you have groups. — njzk2, 1 hour ago

jonrsharpe · Accepted Answer · 2014-11-04 11:48:08Z

Firstly, Python has a style guide and (unless you're given a different guide, in which case please provide a link to it) you should follow it. Your code generally follows it, but note that the imports are in the wrong order, it should be:

from collections import Counter
import re
import sys

import numpy as np

Note alphabetical order, and a split between standard library and 3rd-party modules.

Repetition is a big issue when writing good code, and this was an immediate red flag:

url1 = [0, [], []]
url2 = [0, [], []]
url3 = [0, [], []]
url4 = [0, [], []]
url5 = [0, [], []]
url6 = [0, [], []]

Why not make a list, urls, containing these structures? You can cut this to one line, so if you change the structure later you only do it once:

urls = [[0, [], []] for _ in range(6)]

There are numerous other elements of repetition in your code, which can be reduced in a similar way.

parse_server_log is much too long. You should split it up into smaller, self-contained functions with sensible parameters and return values, which will make your code much easier to read, understand and maintain. Each should have a docstring explaining exactly what it does. This will also help in reducing repetition.

Bare except is a bad idea - at the very least, you should use except Exception, but much better is to figure out what errors could occur, and to handle them explicitly. See e.g. "The evils of except".

You should use string formatting, e.g.

print "Average response time: ", round(np.average(my_num_list), 2), " in ms."

should be

print "Average response time: {0:.2f} in ms.".format(np.average(my_num_list))

Use tuple packing and unpacking, e.g.

for count in counts:
    if(count[1]> swap):
        swap = count[1]
        tar_dyno = count[0]

could be:

for dyno_, swap_ in counts:
    if swap_ > swap:
        swap, tar_dyno = swap_, dyno_

Your file check

try:
    f = open(path_to_file)
except:
    print "\nInvalid file name and/or path. Please correct!\n"
    return

never closes the file, which stays open through the whole function. You can check if a file exists in Python using the os module; do that instead.

A small point re your last point - if the file can't be opened there is nothing to close anyway. And it's usually better to try to open and handle the exception, rather than doing an existence check beforehand. Even an existence check can be misleading because the file might exist at one moment and be deleted the next. — Greg Hewgill, 3 hours ago
@GregHewgill you're right, but if the file isn't there the function ends anyway, and the current arrangement doesn't protect against file deletion either. — jonrsharpe, 3 hours ago

njzk2 · Answer 2 · 2014-11-04 22:07:30Z

+ in regex means Once or more. I think in your case you are mistaking it for a concatenation operator.

I think you could factorize a lot with the help of better regex. For example:

method_regx = re.compile("(method=)+[A-Z]+[\s]+")
# ...
k = method_regx.search(my_data)
try:
    line_method = k.group(0).split("=")[1].strip()
except:

Could be rewritten

method_regx = re.compile("method=([A-Z]+)[\s]+")
# ...
k = method_regx.search(my_data)
line_method = k.group(1) if k else ''

Since k will be None if the search did not work.

Likewise, you can probably find a regex that will do all the work and allow you to remove the split from

line_connect_time = float(k.group(0).split("=")[1].split("ms")[0])

asked	today
viewed	1164 times
active	today

current community

your communities

more stack exchange communities

Someone thinks poorly of my server log parser

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python parsing regex python-2.7 logging or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Someone thinks poorly of my server log parser

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python parsing regex python-2.7 logging or ask your own question.

Related

Hot Network Questions