Using Pandas to parse adwords export

Question

I did this exercise yesterday mostly as practice, but it has some utility as well in day to day. I was basically attempting to take a string that looked like the following:

src=bing_adid=8312488564_kw=swiftcapital com_kwid=35030235383_mt=e_qs=swiftcapital.com_device=c'

And create a series of columns based on the parameters within the string. Here you can see src, kw, kwid, mt, etc.

I was given a set of rules for the string, and used those in my assumptions; as follows:

Exercise 1: Parsing out variables

Any of the ten variables might appear in a string (src, adid, kw, kwid, mt, dist, qs, adpos, device, placement).

The value for that parameter will always be after the equal sign, i.e. src=bing (bing is the value being parsed).

At this point, the length is inconsistent for any given parameter.

Not every parameter will appear in every string.

Not every parameter will appear in the same position in every string.

The parameters will always be delimited with an underscore.

Exercise 2: Absence of parameters (catch all)

If none of that ten parameters appear, then it should all be captured into a catch-all column (i.e., Site Link).

In addition, I used the assumption that an _ and = character both could occur in a string which is why I elected to use regular expressions (this was as much for practical utility as it was just to teach myself some more regex.)

Here's my solution using pandas, csv, and re, please let me know if you can think of an easier/smarter way in python. (I have a feeling I could have hard-coded this using things like FIND() in excel but I want to avoid Excel whenever possible.

# we use pandas and regular expression libraries
import pandas as pd
import re

# dropping all but the 'input' column; dropping all rows in that column that are NaN
df = pd.read_csv('Parsing.csv')
df = df[df['input'].notnull() == True]
df = df[['input']]

# defining params to search for in our keywords
parameters = ['src', 'adid', 'kw', 'kwid', 'mt', 'dist', 'qs', 'adpos', 'device', 'placement']
new_params = [params+'=' for params in parameters]

#split apart parsed data will be stored here in a mock_dataframe which we will reinject to pandas 
mock_df = {'comb_str=': [],
        'src=': [], 
         'adid=': [], 
         'kw=': [], 
         'kwid=': [], 
         'mt=': [], 
         'dist=': [], 
         'qs=' : [], 
         'adpos=' : [], 
         'device=' : [],
         'placement=' : [],
          'catchall' : []}

# nested for loop that looks for the conditions using regular expressions and appends to our mock_df
a = [x for x in df['input']]
for entry in a:
    mock_df['comb_str='].append(entry)
    if any(params in entry for params in new_params):
        mock_df['catchall'].append('')
        for params in new_params:
            if params in entry:
                mock_df[params].append(re.search(params+'(.*?)(?=_[^_=\n]+=|$)',entry).group(1))
            else:
                mock_df[params].append('')
    else:
        for params in new_params:
            mock_df[params].append('')
        mock_df['catchall'].append(entry)

# to get an idea what the new df looks like
df_upd = pd.DataFrame(data=mock_df)
df_upd

# port new DF to a csv for excel use. 
df_upd.to_csv(path_or_buf='parsed.csv')

I'm hoping to get help on removing complexity in my code, and if possible making the code more efficient. I would imagine an input might be 1mil or so of the above strings, with 1mil*#of params defined columns as an output. So it can get messy in a hurry.

comb_str is just re-importing the string itself and then adding it back to the DF in case it's needed to be used in an index or something later.

SuperBiasedMan · Answer 1 · 2015-10-29 15:28:30Z

You don't need to comment so often. Comments are good for explaining code or the abstract reason behind code, if either of those is unclear. Your comment about imports isn't necessary:

# we use pandas and regular expression libraries
import pandas as pd
import re

The code is clear, people using Python know what imports are and so this will be perfectly readable. On the other hand, your note about how you're filtering the pandas data is useful to me. It may be just because I don't use pandas, but the explanation of these commands cleared code that otherwise made no sense.

You could make your mock_df much quicker with a dictionary comprehension of your new_params list:

mock_df = {key: [] for key in new_params}

Of course this leaves out comb_str and catchall, but you can add those manually after.

mock_df['comb_str'] = []
mock_df['catchall'] = []

If you need a copy of df['input'] to iterate over a list comprehension is a slow way. Depending on how different dataframes are you might be able to use copy from the copy module.

from copy import copy

...

for entry in copy(df['input']):

If that doesn't seem to work, I also see pandas has its own copy function.

But it's unclear why you need to duplicate it. THe data shouldn't be affected by looping over it. And in any case you don't use it or your copy after this point.

Also, you append every entry to mock_df['comb_str='], why not just assign the full list to that after the loop? It'd save you time appending on each iteration.

Using params as your name for each element of new_params is quite confusing. Surely you only have one parameter at a time that you're using? Having that as a plural throws me off and makes me think it's a list of parameters.

I think you're making the flow a bit complicated too. Instead of using any try looping over all your new_params and then set a flag if a parameter is found. Then after the loop test if this flag was ever set, if not you could append to 'catchall'. Here's what I mean:

found_param = False
for params in new_params:
    if params in entry:
        mock_df[params].append(re.search(params+'(.*?)(?=_[^_=\n]+=|$)', entry).group(1))
        found_param = True
    else:
        mock_df[params].append('')
mock_df['catchall'].append('' if found_param else entry)

This makes sure that every value is being set and it's clearer to see that happening. Also in case you're not familiar with them, I'm using a ternary expression to append to 'catchall'.

('' if found_param else entry)

A ternary will return one of 2 values based on an evaluated boolean. In this case the boolean is found_param, and if it's True then it returns the first value, which is an empty string. However if it's False then it will be set as entry instead. You can just use the normal if syntax if you prefer. Opinions differ on whether ternaries are more readable.

Thanks SBM! I think the only reason I used any() was potential savings if for instance the catch-all entries (ones with no params) were a very small population of the strings so we wouldn't have to check every param in order to get out of the for loop (any would just go until it found one it violated). Your solution is more organized though. Would it be smarter to do this sort of thing using objects? e.g. self.param or something? Faster / more organized? Is the idea that I had to do this just conceptually ok or is it kind of not great code. — mburke05, Oct 29 '15 at 16:36
@mburke05 Personally I think this works fine as a function. It depends on what you do with it after this point that makes more of a difference. Generally, the key use to a class/object is that they can hold attributes and call functions on themselves in a neater way. If this sounds like your use case going forward, it might be worth making a class. But if you did that, most of your code here would just go in an __init__ function anyway. — SuperBiasedMan, Oct 29 '15 at 17:11

asked	2 months ago
viewed	56 times
active	2 months ago

current community

your communities

more stack exchange communities

Using Pandas to parse adwords export

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python python-2.7 regex pandas or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Using Pandas to parse adwords export

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python python-2.7 regex pandas or ask your own question.

Related

Hot Network Questions