Take the 2-minute tour ×

Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

Parsing URLs in Pandas DataFrame

up vote 4 down vote favorite

My client needs their Google AdWords destination URL query parsed and the values spell checked to eliminate any typos ("use" instead of "us", etc).

I'm pulling the data using the AdWords API and putting it into a dateframe for manipulation. Everything works, but there are over 100,000 records every pull and sometimes the code takes hours and hours to run. Is there a way to optimize the following code blocks?

def parse_url(df):
    for index, row in df.iterrows():
        parsed = urlparse(str(row['Destination URL'])).query
        parsed = parse_qs(parsed)
        for k, v in parsed.iteritems():
            df.loc[index, k.strip()] = v[0].strip().lower()
    return df

def typo_correct(urlparams, df, dictionary):
    for index, row in df.iterrows():
         for w in urlparams:
            if df.loc[index,w] == None or len(df.loc[index,w])<2 or w == 'account':
                pass
            else: 
                high = 0.0
                word = None           
                for item in dictionary:
                    prob = lev.ratio(str(df.loc[index,w]), item)
                    if prob == 1.0:
                        high = prob
                        word = str(df.loc[index, w])
                        continue
                    elif prob > high:
                        high = prob
                        word = item+"*"
                    else:
                        pass
                if high != 1.0:                
                    df.loc[index,w] = word
                    df.loc[index, 'Fix'] = "X"
    return df

Basically it parses out the query parameters, and puts them into a dictionary. The script takes the keys and creates headers in the dataframe, then the first function above iterates through and puts the values in the correct location.

The second one then goes through each value and checks if it's in a dictionary text file and uses the Levenshtein edit distance to find the right word in the case of a typo.

I'm not sure if this is something that can be done using map or apply as I haven't been working with Pandas long. Does anyone have any suggestions?

edited May 2 at 12:20

RubberDuck
19.6k233114

asked Nov 21 '14 at 19:12

Ryan Allen
212

add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Post as a guest

Name

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged python performance url pandas levenshtein-distance or ask your own question.

question feed

asked	9 months ago
viewed	131 times

current community

your communities

more stack exchange communities

Parsing URLs in Pandas DataFrame

Your Answer

Browse other questions tagged python performance url pandas levenshtein-distance or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Parsing URLs in Pandas DataFrame

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged python performance url pandas levenshtein-distance or ask your own question.

Related

Hot Network Questions