Changing cuts to sed for parsing dates from apache log (common log file)

Question

Would switching my cuts below to SED improve performance? I am trying to get a per date count of requests for the last two weeks from a server log. The script runs, but slowly (comes in around 14 minutes). The file is 8,867,820 lines long and around 1.9G. I would guess grep, SED (or AWK) would do this more efficiently but my initial attempts failed and I resorted to cuts.

Is my piping and redirection causing unnecessary delay? Is this simply an issue of a large file?

# Log is in common log format
# host ident authuser date request status bytes
# 127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

#initialize check date and end date
ckdate=$(date --date='1 days ago' +%s)
#enddate=$(date --date='1 fortnight ago' +%s)
enddate=$(date --date='3 days ago' +%s) #test date for shorter term

#feed log in reverse into loop
# tac to print log backwards into loop (newest log entry at bottom of file)
# chkdate is seconds most recent date from line is
tac "/etc/httpd/logs/access_log" | \
 while read line && (( $ckdate >= $enddate ))
do
  #send line into output, cutting out the date-time (-f4),
  # and cleaning for date only (-d: f1)
  #echo $line | cut -d ' ' -f4 | tr -d '[' | cut -d: -f1

  echo $line | cut -d ' ' -f4 | tr -d '[' \
    | cut -d: -f1 | tr '/' '-' | xargs -i date -d '{}' +'%Y-%m-%d'

  #update the check date based on latest line, formatting as seconds since 1970
  ckdate=$( ( echo $line | cut -d ' ' -f4 | tr -d '[' \
    | cut -d: -f1 | tr '/' '-' | xargs -i date -d '{}' +'%s' ) ) 
done | sort | uniq -c | head -n -1 | head -n 2
#put output into sorted list, with uniq counts, and trim outer bounds

200_success · Accepted Answer · 2014-12-08 02:05:16Z

Your suspicions are correct. The amount of work that the operating system does for you to spawn a process is formidable — much more work than the string processing work itself.

For every line in the log file, you are spawning 13 processes:

cut
tr
cut
tr
xargs
date
bash — due to the use of $( ) substitution
cut
tr
cut
tr
xargs
date

Therefore, the problem can be simplified to that of extracting, reformatting, and comparing dates repeatedly using the fewest subprocesses possible. I would even advise against any use of the date command within the loop. The whole thing can be done with just string processing, using sed and awk.

# Log is in common log format
# host ident authuser date request status bytes
# 127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

first_date=$(date --date='1 fortnight ago' +%Y-%m-%d)
last_date=$(date --date='1 day ago' +%Y-%m-%d)

# Processing the log file in reverse chronological order,
# extract the date as YYYY-MM-DD, if it is between
# $first_date and $last_date, inclusive.
#
# Then, present the hit counts in chronological order.
tac /etc/httpd/logs/access_log | \
    sed -e 's!.*\[\(../.../....\):.*!\1!' | \
    gawk -F/ -v first_date="$first_date" -v last_date="$last_date" '
        BEGIN {
            M["Jan"] = "01"; M["Feb"] = "02"; M["Mar"] = "03"; M["Apr"] = "04";
            M["May"] = "05"; M["Jun"] = "06"; M["Jul"] = "07"; M["Aug"] = "08";
            M["Sep"] = "09"; M["Oct"] = "10"; M["Nov"] = "11"; M["Dec"] = "12";
        }
        {
            DATE=sprintf("%s-%s-%s", $3, M[$2], $1);
            if (DATE < first_date) { exit }
            if (DATE <= last_date) { print DATE }
        }
    ' | \
    uniq -c | \
    tac

Some additional remarks:

I am puzzled by the fact that $enddate was defined with a granularity of one second. I would have expected that you would want to use midnight as the limit, not whatever time of day you happen to run the script at.
I am also puzzled by the use of head -n -1 and head -n 2 to discard some of the results. Why not just set the first and last date correctly in the first place?
Since the output from the loop is already in reverse chronological order, sort merely reverses everything. tac would accomplish the same thing more easily.

I would also suggest that you not bother processing the log file in reverse. Reading a file line-by-line backwards is trickier, as reading is not done sequentially, byte by byte. Furthermore, the need to reverse the results requires buffering all of the output, preventing the script from printing results incrementally a day at a time. Reverse chronological processing only makes sense if the log file is enormous and covers a period much much longer than the fortnight of interest. However, if your log file is really that large, you really ought to consider a log-rolling mechanism. If each log file only spans one day, the entire problem becomes a trivial line count exercise!

Thanks, even with a loop in place of awk and the tacs in place this now runs around 7 seconds.The enddate was trying to do integer comparison for the seconds. The multiple head statement was just sloppy work on my part. But yeah, I need to learn sed and awk. — Marc K, Dec 8 '14 at 0:58

asked	9 months ago
viewed	68 times
active	9 months ago

current community

your communities

more stack exchange communities

Changing cuts to sed for parsing dates from apache log (common log file)

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged performance datetime bash logging sed or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Changing cuts to sed for parsing dates from apache log (common log file)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged performance datetime bash logging sed or ask your own question.

Related

Hot Network Questions