The easiest way to go about this problem is to separate the tags, unstack them then stack them back in a way that doesn't lose creationDate values, this is done through concat
, group the data then use pivote_table
to rearrange the data where the periods are the columns and the tags are the index
To group the data take one of two approaches;
The first approach is to keep the creation date as a string and extract the year through str.split
, then group the data by year and tag
import pandas as pd
# top n tags for each year will be displayed
n = 10
debiandf = pd.read_csv("filename")
debiandf = debiandf.reset_index().set_index(['index', "CreationDate"])
df = debiandf["Tags"].str.extractall(r'\<(.*?)\>').unstack()
df.columns = df.columns.droplevel(0)
# take all columns in df, convert each column to a df with
# columns Year(CreationDate) and tag and stack them on top of eachother
# the resulting x will be a 2 column dataframe
x = pd.concat(
[pd.DataFrame(df[x], columns=['tag']).reset_index(
).rename(columns={'CreationDate': 'Period'}
) for x in df.columns.tolist()]).drop('index', axis=1)
# change the value of year from "%Y-%m-%d %H:%M:%S" to "%Y" using split
x['Period'] = x['Period'].apply(lambda x: x.split('-')[0])
# group values of x according to year and tag that will produce a 3 column
# ['Period','tag','count']
x6 = x.groupby(['Period', 'tag'])['tag'].agg({'count': len}).reset_index(
# use pivote_table to reorganize the data
).pivot_table(index=['tag'], columns='Period').xs(
# the resulting df will have the tags as index and the years as columns
'count', axis=1, drop_level=True)
topn = pd.concat([pd.DataFrame(x6[col].nlargest(n)).astype(
int).reset_index().rename(columns={
col: 'count'}) for col in x6.columns],
keys=x6.columns, axis=1)
print(topn[topn.columns[-6:]])
The second approach involves the use of pd.tseries.resample.TimeGrouper
, to resample the date creation yearly A
while grouping the data by year and tag
import pandas as pd
n = 10
debiandf = pd.read_csv("filename")
debiandf = debiandf.reset_index()
debiandf["CreationDate"] = pd.to_datetime(
debiandf["CreationDate"], format="%Y-%m-%d %H:%M:%S")
debiandf.set_index(['index', 'CreationDate'], inplace=True)
df = debiandf["Tags"].str.extractall(r'\<(.*?)\>').unstack()
df.columns = df.columns.droplevel(0)
# take all columns in df, convert each column to a df with
# columns Year(CreationDate) and tag and stack them on top of eachother
# the resulting x will be a 2 column dataframe
x = pd.concat(
[pd.DataFrame(df[x], columns=['tag']).reset_index(
).rename(columns={'CreationDate': 'Period'}
) for x in df.columns.tolist()]).drop(
'index', axis=1).set_index('Period')
# group the data by the tags and the creationDate resampled to yearly
x6 = x.groupby([pd.tseries.resample.TimeGrouper('A'), 'tag']
)['tag'].agg({'count': len}).rename(
index=lambda x: str(
x.year) if type(x) != str else x).reset_index().pivot_table(
index=['tag'], columns='Period').xs('count', axis=1, drop_level=True)
topn = pd.concat([pd.DataFrame(x6[col].nlargest(n)).astype(
int).reset_index().rename(columns={
col: 'count'}) for col in x6.columns],
keys=x6.columns, axis=1)
print(topn[topn.columns[-6:]])
Note that the second approach is more flexible and if you decide to look at a resample rate, i.e. look at quarterly data rather than yearly, all you have to change will be the resample rates and the format of the period columns.
A sample output for both snippets is shown below, the output contains the top 10 tags for each year:
Period 2014 2015 2016
tag count tag count tag count
0 ubuntu 957 ubuntu 854 ubuntu 1010
1 linux 428 linux 419 linux 409
2 debian 318 debian 343 debian 339
3 apache-2.2 195 apache-2.2 120 ubuntu-14.04 91
4 ubuntu-12.04 70 ubuntu-14.04 50 apache-2.2 76
5 nginx 59 nginx 41 nginx 67
6 mysql 38 ssh 38 apache-2.4 46
7 networking 36 networking 37 networking 44
8 ssh 33 mysql 29 mysql 31
9 domain-name-system 16 apache-2.4 28 ssh 27
pandas
followers on this board. There's a lot more interest Stackoverflow - but SO wants a stackoverflow.com/help/mcve. – hpaulj Dec 30 '16 at 22:29