Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

My client needs their Google AdWords destination URL query parsed and the values spell checked to eliminate any typos ("use" instead of "us", etc).

I'm pulling the data using the AdWords API and putting it into a dateframe for manipulation. Everything works, but there are over 100,000 records every pull and sometimes the code takes hours and hours to run. Is there a way to optimize the following code blocks?

def parse_url(df):
    for index, row in df.iterrows():
        parsed = urlparse(str(row['Destination URL'])).query
        parsed = parse_qs(parsed)
        for k, v in parsed.iteritems():
            df.loc[index, k.strip()] = v[0].strip().lower()
    return df

def typo_correct(urlparams, df, dictionary):
    for index, row in df.iterrows():
         for w in urlparams:
            if df.loc[index,w] == None or len(df.loc[index,w])<2 or w == 'account':
                pass
            else: 
                high = 0.0
                word = None           
                for item in dictionary:
                    prob = lev.ratio(str(df.loc[index,w]), item)
                    if prob == 1.0:
                        high = prob
                        word = str(df.loc[index, w])
                        continue
                    elif prob > high:
                        high = prob
                        word = item+"*"
                    else:
                        pass
                if high != 1.0:                
                    df.loc[index,w] = word
                    df.loc[index, 'Fix'] = "X"
    return df

Basically it parses out the query parameters, and puts them into a dictionary. The script takes the keys and creates headers in the dataframe, then the first function above iterates through and puts the values in the correct location.

The second one then goes through each value and checks if it's in a dictionary text file and uses the Levenshtein edit distance to find the right word in the case of a typo.

I'm not sure if this is something that can be done using map or apply as I haven't been working with Pandas long. Does anyone have any suggestions?

share|improve this question

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.