I am working on learning how to do frequency analysis of Server Fault question tags to see if there is any useful data that I can glean from them. I'm storing the raw data in Bitbucket for global access, so this code will use the same dataset I am using, it's about 30Mb.

import pandas as pd

debiandf = pd.read_csv("https://goo.gl/zLDPw6")
debiandf["CreationDate"] = pd.to_datetime(debiandf["CreationDate"], format="%Y-%m-%d %H:%M:%S")
debiandf = debiandf.set_index(["CreationDate"])

tag_df = pd.DataFrame(index=debiandf.index, data=debiandf["Tags"])
tag_df = tag_df.reset_index().drop_duplicates(subset='CreationDate', keep='last').set_index('CreationDate')
x = tag_df["Tags"].str.extractall(r'\<(.*?)\>').unstack()
x.columns = x.columns.droplevel(0)
# column names signify the index location of the tag when extracted.
# i.e. with <ubuntu><networking><tag3> you would have [ubuntu,networking,tag3]
x.rename(columns={0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, inplace=True)

x1 = x.groupby(x.index.year)[1].apply(lambda grp: grp.value_counts().head(5))
x2 = x.groupby(x.index.year)[2].apply(lambda grp: grp.value_counts().head(5))
x3 = x.groupby(x.index.year)[3].apply(lambda grp: grp.value_counts().head(5))
x4 = x.groupby(x.index.year)[4].apply(lambda grp: grp.value_counts().head(5))
x5 = x.groupby(x.index.year)[5].apply(lambda grp: grp.value_counts().head(5))

x6 = pd.concat([x1,x2,x3], axis=1)
x6 = x6.reset_index()
x6.rename(columns={"level_0": "Year", "level_1": "Tag"}, inplace=True)
print x6

I'm new to using pandas and I'm learning how to do data analysis on my own so I can produce original content for reddit. How can I simplify my x.groupby lines so I can get the top 5 value counts from every column in my x DataFrame?

I'm new to pandas, so if you could explain why it would be done that way, that would be really helpful. Thanks!

share|improve this question

bumped to the homepage by Community 2 days ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

    
There aren't many pandas followers on this board. There's a lot more interest Stackoverflow - but SO wants a stackoverflow.com/help/mcve. – hpaulj Dec 30 '16 at 22:29
    
Yeah, that's what I was thinking. Thanks! – mynameismevin Dec 31 '16 at 3:26

The easiest way to go about this problem is to separate the tags, unstack them then stack them back in a way that doesn't lose creationDate values, this is done through concat, group the data then use pivote_table to rearrange the data where the periods are the columns and the tags are the index

To group the data take one of two approaches;

The first approach is to keep the creation date as a string and extract the year through str.split, then group the data by year and tag

import pandas as pd

# top n tags for each year will be displayed
n = 10
debiandf = pd.read_csv("filename")
debiandf = debiandf.reset_index().set_index(['index', "CreationDate"])
df = debiandf["Tags"].str.extractall(r'\<(.*?)\>').unstack()
df.columns = df.columns.droplevel(0)

# take all columns in df, convert each column to a df with
# columns Year(CreationDate) and tag and stack them on top of eachother
# the resulting x will be a 2 column dataframe
x = pd.concat(
    [pd.DataFrame(df[x], columns=['tag']).reset_index(
    ).rename(columns={'CreationDate': 'Period'}
             ) for x in df.columns.tolist()]).drop('index', axis=1)

# change the value of year from "%Y-%m-%d %H:%M:%S" to "%Y" using split
x['Period'] = x['Period'].apply(lambda x: x.split('-')[0])

# group values of x according to year and tag that will produce a 3 column
# ['Period','tag','count']
x6 = x.groupby(['Period', 'tag'])['tag'].agg({'count': len}).reset_index(
    # use pivote_table to reorganize the data
).pivot_table(index=['tag'], columns='Period').xs(
    # the resulting df will have the tags as index and the years as columns
    'count', axis=1, drop_level=True)

topn = pd.concat([pd.DataFrame(x6[col].nlargest(n)).astype(
    int).reset_index().rename(columns={
        col: 'count'}) for col in x6.columns],
    keys=x6.columns, axis=1)
print(topn[topn.columns[-6:]])

The second approach involves the use of pd.tseries.resample.TimeGrouper, to resample the date creation yearly A while grouping the data by year and tag

import pandas as pd

n = 10
debiandf = pd.read_csv("filename")
debiandf = debiandf.reset_index()
debiandf["CreationDate"] = pd.to_datetime(
    debiandf["CreationDate"], format="%Y-%m-%d %H:%M:%S")
debiandf.set_index(['index', 'CreationDate'], inplace=True)
df = debiandf["Tags"].str.extractall(r'\<(.*?)\>').unstack()
df.columns = df.columns.droplevel(0)

# take all columns in df, convert each column to a df with
# columns Year(CreationDate) and tag and stack them on top of eachother
# the resulting x will be a 2 column dataframe
x = pd.concat(
    [pd.DataFrame(df[x], columns=['tag']).reset_index(
    ).rename(columns={'CreationDate': 'Period'}
             ) for x in df.columns.tolist()]).drop(
    'index', axis=1).set_index('Period')
# group the data by the tags and the creationDate resampled to yearly
x6 = x.groupby([pd.tseries.resample.TimeGrouper('A'), 'tag']
               )['tag'].agg({'count': len}).rename(
    index=lambda x: str(
        x.year) if type(x) != str else x).reset_index().pivot_table(
    index=['tag'], columns='Period').xs('count', axis=1, drop_level=True)
topn = pd.concat([pd.DataFrame(x6[col].nlargest(n)).astype(
    int).reset_index().rename(columns={
        col: 'count'}) for col in x6.columns],
    keys=x6.columns, axis=1)
print(topn[topn.columns[-6:]])

Note that the second approach is more flexible and if you decide to look at a resample rate, i.e. look at quarterly data rather than yearly, all you have to change will be the resample rates and the format of the period columns.

A sample output for both snippets is shown below, the output contains the top 10 tags for each year:

Period                2014                2015                2016      
                       tag count           tag count           tag count
0                   ubuntu   957        ubuntu   854        ubuntu  1010
1                    linux   428         linux   419         linux   409
2                   debian   318        debian   343        debian   339
3               apache-2.2   195    apache-2.2   120  ubuntu-14.04    91
4             ubuntu-12.04    70  ubuntu-14.04    50    apache-2.2    76
5                    nginx    59         nginx    41         nginx    67
6                    mysql    38           ssh    38    apache-2.4    46
7               networking    36    networking    37    networking    44
8                      ssh    33         mysql    29         mysql    31
9       domain-name-system    16    apache-2.4    28           ssh    27
share|improve this answer
1  
The OP's code is way more readable... – Graipher Jan 19 at 13:14

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.