Shell Script - Awk Optimization

Question

looking for some assistance trying to optimize a bro network log parsing script, here's the background:

I have a large amount of bro logs, but I'm only interested in querying IPs within my scope (multiple variable length subnets).

So I have a text file with regex patterns to match the IP ranges I'm looking for: scope.txt:

/^10\.0\.0\.([8-9]|[1-3][0-9]|4[0-5])$/

(scope.txt contains up to 20 more lines of other IP ranges in regex patters) findInScope.sh:

#!bin/sh
for file in /data/bro_logs/2016-11-26/conn.*.log.gz
do
    echo "$file"
    touch /tmp/$file
    for nets in $(cat scope.txt)
    do
        echo "$nets"
        zcat $file | bro-cut -d | awk '$3 ~ '$nets' || $5 ~ '$nets'' >> /tmp/$file
    done
    sort /tmp/$file | uniq > ~/$file
    rm /tmp/$file
done

So more background, each hour of original bro conn logs is about 100MBs, so my current script takes about 10-20 minute to parse through one hour of log data. So one day of logs can take up to 3 hours.

I though about a single awk statement with 40 or's but decided I don't want to do that because I want a separate scope.txt file in to use the same script for different scopes of IP ranges.

I also tried zcat on multiple conn.log files (i.e. zcat conn.*.log.gz) but the output file ended up being over 1GB, and I wanted to keep hourly logs intact.

Looking forward to some ideas, thank you all.

-Ivan

meuh · Answer 1 · 2016-11-27 13:15:32Z

up vote 5 down vote

You should gain a lot by passing the log file just once through awk. This means combining all the regexps into one. If you dont want to do this in your scope.txt file, then do it before calling awk. For example,

sed <scope.txt 's|^/\^|(|; s|\$/$|)|; $!s/$/|/' | tr -d '\n' >pattern

zcat $file | bro-cut -d |
awk '
BEGIN{ getline pat <"pattern"; pat = "^(" pat ")$" }
$3 ~ pat || $5 ~ pat
'  >~/$file

The sed replaces the /^ and $ surrounding each regexp line with a enclosing () pair, adds an | at the end of the line, and puts the result all on one line into file pattern. This file is therefore all the patterns or-ed together. The missing ^(...)$ is added in the awk script BEGIN statement, which reads the pattern file into variable pat.

The above replaces your inner for loop, and the sort|uniq.

edited 5 hours ago

answered 8 hours ago

meuh

17.1k1729

1

Not quite; ^(one)|(two)|(three)|(four)$ only anchors one at the left, four at the right, and two and three not at all. You want ^(one|two|three|four)$ if all the individual patterns are anchored-both like the example given; otherwise leave the anchors (if present) within each alternative (unchanged) and don't add any. – dave_thompson_085 5 hours ago

Oops, you are right. Thanks. – meuh 5 hours ago

add a comment |

Wildcard · Answer 2 · 2016-11-27 16:25:07Z

The simplest answer is to use scope.txt, very slightly modified, as a patternfile, and use zcat | grep (or just zgrep) to get the lines you need.

First, modify your scope file to change:

/^10\.0\.0\.([8-9]|[1-3][0-9]|4[0-5])$/

into:

(^|[^0-9.])(10\.0\.0\.([8-9]|[1-3][0-9]|4[0-5]))($|[^0-9.])

To do this easily you can use:

sed -e 's:^/\^:(^|[^0-9.])(:' -e 's:\$/$:)($|[^0-9.]):' scope.txt > grepscope.txt

Then, do your search:

zgrep -Ehf grepscope.txt /data/bro_logs/2016-11-26/conn.*.log.gz | less

Or, since you want the output for each file stored separately:

for f in /data/bro_logs/2016-11-26/conn.*.log.gz; do
    zgrep -Ehf grepscope.txt "$f" | sort -u > ~/"${f##*/}"
done

Note also that the "for" loop variable $f will contain the entire path to each file in turn; to avoid the errors we would get if we tried to direct output to ~/"$f" (which would refer to subdirectories ~/data/bro_logs/2016-11-26 that probably don't exist in your home directory), we strip off everything up to the final slash in the pathname and just use the base name of each log file.

The flags to zgrep bear mentioning:

-E specifies extended regex, so that the parentheses in your patterns don't need to be escaped.

-h suppresses printing the filename as a prefix to each matching line. (You can omit this in the for loop version, since by default grep only prints the filename when searching more than one file, as in the first command I specified—but it doesn't hurt anything to keep it in both versions.)

-f allows you to specify a patternfile. This is exactly what you need, according to your question, and using grep -f allows you to use multiple search patterns taken from a file, without constructing an Awk command with huge numbers of "or"s.

sort | uniq can generally be replaced by sort -u, unless you need to use some of uniq's option flags. In this case you don't, so I've used the simpler form sort -u.

@iruvar, thanks for the catch! I was consulting a BSD grep man page (on OS X). grep -z or grep -Z on BSD is equivalent to zgrep on Linux; however zgrep works in either place, so would be the preferred way to call for unzipping. GNU grep -z, as you point out, modifies grep's behavior to use null-delimiting input rather than newline delimited. I've updated the answer to correct this. — Wildcard, 2 hours ago

asked	today
viewed	145 times
active	today

current community

your communities

more stack exchange communities

Shell Script - Awk Optimization

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged shell awk or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Shell Script - Awk Optimization

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged shell awk or ask your own question.

Related

Hot Network Questions