Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

The problem with me is bit hard to explain. I'm analyzing a Apache log file which following is one line from it.

112.135.128.20 - [13/May/2013:23:55:04 +0530] "GET /SVRClientWeb/ActionController HTTP/1.1" 302 2 "https://www.example.com/sample" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_3 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Mobile/10B329" GET /SVRClientWeb/ActionController - HTTP/1.1 www.example.com

Some parts from my code:

df = df.rename(columns={'%>s': 'Status', '%b':'Bytes Returned', 
                        '%h':'IP', '%l':'Username', '%r': 'Request', '%t': 'Time', '%u': 'Userid', '%{Referer}i': 'Referer', '%{User-Agent}i': 'Agent'})
df.index = pd.to_datetime(df.pop('Time'))
test = df.groupby(['IP', 'Agent']).size()
test.sort()
print test[-20:]

I read log file to a data frame and get the following output with hit counts and user agents.

IP               Agent                                                                                                 
74.86.158.106    Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)                                     369
203.81.107.103   Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20100101 Firefox/21.0                                          388
173.199.120.155  Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/)                                         417
124.43.84.242    Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31      448
112.135.196.223  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36      454
124.43.155.138   Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0                                   461
124.43.104.198   Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20100101 Firefox/21.0                                          467

Then I want to get the

  1. Most highest 3 hit counts(their IPs) and find the frequency of their occurrence?(like time difference between each hit occurrence of the IP)
  2. How to find whether there are different agents for one single IP?

At least please explain me how to solve above problems?

share|improve this question

1 Answer

up vote 1 down vote accepted

To do the first part you could just sort the DataFrame (by count) and take the top three rows:

In [11]: df.sort('Count', ascending=False).head(3)
Out[11]:
                IP                                              Agent  Count
6   124.43.104.198  Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20...    467
5   124.43.155.138  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) G...    461
4  112.135.196.223  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3...    454

To test whether there are multiple rows (Agents) for a single IP you can use groupby:

In [12]: g = df.groupby('IP')

In [13]: repeated = g.count().Count != 1

In [14]: repeated
Out[14]:
IP
112.135.196.223    False
124.43.104.198     False
124.43.155.138     False
124.43.84.242      False
173.199.120.155    False
203.81.107.103     False
74.86.158.106      False
Name: Count, dtype: bool

In [15]: repeated[repeated]
Out[15]: Series([], dtype: bool)

There are none in this example.

share|improve this answer
Thanks Andy, You already know most of the things about my project :). So how about taking time difference of head(3) IPs?e.g 124.43.104.198 first occurs at 06.05.02 and again there is a hit at 06.10.03. Please explain this for just one IP? – Nilani Algiriyage 8 mins ago
It kind of is a little trickier and tbh not 100% sure what exactly you want, I think it will make more sense as a separate question (then you can go into some more detail) :) – Andy Hayden 5 mins ago
Ok fine,Thanks very much! :) – Nilani Algiriyage 3 mins ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.