Simplify DataFrame operations in Python

Question

I am working on learning how to do frequency analysis of Server Fault question tags to see if there is any useful data that I can glean from them. I'm storing the raw data in Bitbucket for global access, so this code will use the same dataset I am using, it's about 30Mb.

import pandas as pd

debiandf = pd.read_csv("https://goo.gl/zLDPw6")
debiandf["CreationDate"] = pd.to_datetime(debiandf["CreationDate"], format="%Y-%m-%d %H:%M:%S")
debiandf = debiandf.set_index(["CreationDate"])

tag_df = pd.DataFrame(index=debiandf.index, data=debiandf["Tags"])
tag_df = tag_df.reset_index().drop_duplicates(subset='CreationDate', keep='last').set_index('CreationDate')
x = tag_df["Tags"].str.extractall(r'\<(.*?)\>').unstack()
x.columns = x.columns.droplevel(0)
# column names signify the index location of the tag when extracted.
# i.e. with <ubuntu><networking><tag3> you would have [ubuntu,networking,tag3]
x.rename(columns={0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, inplace=True)

x1 = x.groupby(x.index.year)[1].apply(lambda grp: grp.value_counts().head(5))
x2 = x.groupby(x.index.year)[2].apply(lambda grp: grp.value_counts().head(5))
x3 = x.groupby(x.index.year)[3].apply(lambda grp: grp.value_counts().head(5))
x4 = x.groupby(x.index.year)[4].apply(lambda grp: grp.value_counts().head(5))
x5 = x.groupby(x.index.year)[5].apply(lambda grp: grp.value_counts().head(5))

x6 = pd.concat([x1,x2,x3], axis=1)
x6 = x6.reset_index()
x6.rename(columns={"level_0": "Year", "level_1": "Tag"}, inplace=True)
print x6

I'm new to using pandas and I'm learning how to do data analysis on my own so I can produce original content for reddit. How can I simplify my x.groupby lines so I can get the top 5 value counts from every column in my x DataFrame?

I'm new to pandas, so if you could explain why it would be done that way, that would be really helpful. Thanks!

There aren't many pandas followers on this board. There's a lot more interest Stackoverflow - but SO wants a stackoverflow.com/help/mcve. — hpaulj, Dec 30 '16 at 22:29

sgDysregulation · Answer 1 · 2017-01-19 12:16:03Z

The easiest way to go about this problem is to separate the tags, unstack them then stack them back in a way that doesn't lose creationDate values, this is done through concat, group the data then use pivote_table to rearrange the data where the periods are the columns and the tags are the index

To group the data take one of two approaches;

The first approach is to keep the creation date as a string and extract the year through str.split, then group the data by year and tag

import pandas as pd

# top n tags for each year will be displayed
n = 10
debiandf = pd.read_csv("filename")
debiandf = debiandf.reset_index().set_index(['index', "CreationDate"])
df = debiandf["Tags"].str.extractall(r'\<(.*?)\>').unstack()
df.columns = df.columns.droplevel(0)

# take all columns in df, convert each column to a df with
# columns Year(CreationDate) and tag and stack them on top of eachother
# the resulting x will be a 2 column dataframe
x = pd.concat(
    [pd.DataFrame(df[x], columns=['tag']).reset_index(
    ).rename(columns={'CreationDate': 'Period'}
             ) for x in df.columns.tolist()]).drop('index', axis=1)

# change the value of year from "%Y-%m-%d %H:%M:%S" to "%Y" using split
x['Period'] = x['Period'].apply(lambda x: x.split('-')[0])

# group values of x according to year and tag that will produce a 3 column
# ['Period','tag','count']
x6 = x.groupby(['Period', 'tag'])['tag'].agg({'count': len}).reset_index(
    # use pivote_table to reorganize the data
).pivot_table(index=['tag'], columns='Period').xs(
    # the resulting df will have the tags as index and the years as columns
    'count', axis=1, drop_level=True)

topn = pd.concat([pd.DataFrame(x6[col].nlargest(n)).astype(
    int).reset_index().rename(columns={
        col: 'count'}) for col in x6.columns],
    keys=x6.columns, axis=1)
print(topn[topn.columns[-6:]])

The second approach involves the use of pd.tseries.resample.TimeGrouper, to resample the date creation yearly A while grouping the data by year and tag

import pandas as pd

n = 10
debiandf = pd.read_csv("filename")
debiandf = debiandf.reset_index()
debiandf["CreationDate"] = pd.to_datetime(
    debiandf["CreationDate"], format="%Y-%m-%d %H:%M:%S")
debiandf.set_index(['index', 'CreationDate'], inplace=True)
df = debiandf["Tags"].str.extractall(r'\<(.*?)\>').unstack()
df.columns = df.columns.droplevel(0)

# take all columns in df, convert each column to a df with
# columns Year(CreationDate) and tag and stack them on top of eachother
# the resulting x will be a 2 column dataframe
x = pd.concat(
    [pd.DataFrame(df[x], columns=['tag']).reset_index(
    ).rename(columns={'CreationDate': 'Period'}
             ) for x in df.columns.tolist()]).drop(
    'index', axis=1).set_index('Period')
# group the data by the tags and the creationDate resampled to yearly
x6 = x.groupby([pd.tseries.resample.TimeGrouper('A'), 'tag']
               )['tag'].agg({'count': len}).rename(
    index=lambda x: str(
        x.year) if type(x) != str else x).reset_index().pivot_table(
    index=['tag'], columns='Period').xs('count', axis=1, drop_level=True)
topn = pd.concat([pd.DataFrame(x6[col].nlargest(n)).astype(
    int).reset_index().rename(columns={
        col: 'count'}) for col in x6.columns],
    keys=x6.columns, axis=1)
print(topn[topn.columns[-6:]])

Note that the second approach is more flexible and if you decide to look at a resample rate, i.e. look at quarterly data rather than yearly, all you have to change will be the resample rates and the format of the period columns.

A sample output for both snippets is shown below, the output contains the top 10 tags for each year:

Period                2014                2015                2016      
                       tag count           tag count           tag count
0                   ubuntu   957        ubuntu   854        ubuntu  1010
1                    linux   428         linux   419         linux   409
2                   debian   318        debian   343        debian   339
3               apache-2.2   195    apache-2.2   120  ubuntu-14.04    91
4             ubuntu-12.04    70  ubuntu-14.04    50    apache-2.2    76
5                    nginx    59         nginx    41         nginx    67
6                    mysql    38           ssh    38    apache-2.4    46
7               networking    36    networking    37    networking    44
8                      ssh    33         mysql    29         mysql    31
9       domain-name-system    16    apache-2.4    28           ssh    27

The OP's code is way more readable... – Graipher Jan 19 at 13:14 — Graipher, Jan 19 at 13:14

asked	2 months ago
viewed	74 times
active	2 days ago

current community

your communities

more stack exchange communities

Simplify DataFrame operations in Python

bumped to the homepage by Community♦ 2 days ago

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python pandas data-mining or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Simplify DataFrame operations in Python

bumped to the homepage by Community♦ 2 days ago

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python pandas data-mining or ask your own question.

Related

Hot Network Questions