I did this exercise yesterday mostly as practice, but it has some utility as well in day to day. I was basically attempting to take a string that looked like the following:
src=bing_adid=8312488564_kw=swiftcapital com_kwid=35030235383_mt=e_qs=swiftcapital.com_device=c'
And create a series of columns based on the parameters within the string. Here you can see src
, kw
, kwid
, mt
, etc.
I was given a set of rules for the string, and used those in my assumptions; as follows:
Exercise 1: Parsing out variables
Any of the ten variables might appear in a string (src, adid, kw, kwid, mt, dist, qs, adpos, device, placement).
The value for that parameter will always be after the equal sign, i.e. src=bing (bing is the value being parsed).
- At this point, the length is inconsistent for any given parameter.
- Not every parameter will appear in every string.
- Not every parameter will appear in the same position in every string.
- The parameters will always be delimited with an underscore.
Exercise 2: Absence of parameters (catch all)
- If none of that ten parameters appear, then it should all be captured into a catch-all column (i.e., Site Link).
In addition, I used the assumption that an _
and =
character both could occur in a string which is why I elected to use regular expressions (this was as much for practical utility as it was just to teach myself some more regex.)
Here's my solution using pandas
, csv
, and re
, please let me know if you can think of an easier/smarter way in python. (I have a feeling I could have hard-coded this using things like FIND()
in excel but I want to avoid Excel whenever possible.
# we use pandas and regular expression libraries
import pandas as pd
import re
# dropping all but the 'input' column; dropping all rows in that column that are NaN
df = pd.read_csv('Parsing.csv')
df = df[df['input'].notnull() == True]
df = df[['input']]
# defining params to search for in our keywords
parameters = ['src', 'adid', 'kw', 'kwid', 'mt', 'dist', 'qs', 'adpos', 'device', 'placement']
new_params = [params+'=' for params in parameters]
#split apart parsed data will be stored here in a mock_dataframe which we will reinject to pandas
mock_df = {'comb_str=': [],
'src=': [],
'adid=': [],
'kw=': [],
'kwid=': [],
'mt=': [],
'dist=': [],
'qs=' : [],
'adpos=' : [],
'device=' : [],
'placement=' : [],
'catchall' : []}
# nested for loop that looks for the conditions using regular expressions and appends to our mock_df
a = [x for x in df['input']]
for entry in a:
mock_df['comb_str='].append(entry)
if any(params in entry for params in new_params):
mock_df['catchall'].append('')
for params in new_params:
if params in entry:
mock_df[params].append(re.search(params+'(.*?)(?=_[^_=\n]+=|$)',entry).group(1))
else:
mock_df[params].append('')
else:
for params in new_params:
mock_df[params].append('')
mock_df['catchall'].append(entry)
# to get an idea what the new df looks like
df_upd = pd.DataFrame(data=mock_df)
df_upd
# port new DF to a csv for excel use.
df_upd.to_csv(path_or_buf='parsed.csv')
I'm hoping to get help on removing complexity in my code, and if possible making the code more efficient. I would imagine an input might be 1mil or so of the above strings, with 1mil*#of params defined columns as an output. So it can get messy in a hurry.
comb_str
is just re-importing the string itself and then adding it back to the DF in case it's needed to be used in an index or something later.