How to improve the computation speed of my shell script/program?

Question

I have written a shell script to process my huge data files (each one having around 7,000,000 lines ~ a weeks data in a single file). Below is the sample of my data file (i.e., input file) structure:

808 836 204 325 148 983
908 836 203 326 148 986
8 835 204 325 149 984
108 835 204 325 148 984
208 836 204 326 149 984
308 834 203 325 149 985
408 836 204 326 149 983
508 834 203 325 149 985
20130402,123358 0 $GPRMC,123358.000,A,5056.3056,N,00622.5644,E,0.00,0.00,020413,,,A*67
608 834 203 325 150 985
708 834 204 326 150 986
808 836 204 325 151 983
908 835 204 325 153 984
8 816 202 325 153 973
108 836 204 324 156 984
208 835 204 325 157 983
308 834 202 324 158 985
408 835 203 325 158 985
508 836 203 324 160 984
20130402,123359 0 $GPRMC,123359.000,A,5056.3056,N,00622.5644,E,0.01,0.00,020413,,,A*67
608 835 204 325 162 986
708 836 204 324 164 983
808 835 202 324 165 986
908 836 204 324 167 983
8 836 202 324 168 985
108 835 203 325 170 986
208 836 203 324 171 983

I have an instrument whose counter provides the data every 0.1 second while the GPS provides its measurement in between every 1 second. For every GPS measurement, I wanted to extract the above 5 lines and below 5 lines of my instrument record simultaneously. I don't want all 6 elements from my instrument record. From my instrument record, I require only the 5th elements of the above & below 5 lines against each GPS record. In addition, from the GPS record, I want to extract date (1st element), time (2nd element), latitude and longitude. So, from the above sample example, I shall obtain as below:

20130402 123358 5056.3056 00622.5644 148 149 149 149 149 150 150 151 153 153
20130402 123359 5056.3056 00622.5644 156 157 158 158 160 162 164 165 167 168

In order to extract & arrange the data as shown above, I have initially written a MATLAB & IDL codes. Then again I wrote a shell script as below:

#!/bin/bash
clear
# Program to read *.bin files
# Data directory
DATAPATH='/home/xyz/datasets/2013_04_09/'

# Output data directory
OUTPATH='/home/abc/data_1/'
count=1; 

# Read the files sequentially
for file in $DATAPATH*.bin; do
  INFILE=$file;      # Input file   
  INFILENAME=`echo $INFILE | awk -F'/' '{print $7}'`
  SUFFIX="1.txt"
  OUTFILE="$OUTPATH${INFILENAME:0:18}$SUFFIX"   # Output file  
  TEMPFILE="${OUTPATH}tempresult_sed.txt"
  awk '{if(length($0) >= 79) print NR,",",$0}' $INFILE | sed 's/ /,/g' > $TEMPFILE
  lines=`cat $TEMPFILE | awk -F, '{print $1}'`
  lat=`cat $TEMPFILE | awk -F, '{print $10}'`
  lon=`cat $TEMPFILE | awk -F, '{print $12}'`
  date=`cat $TEMPFILE | awk -F, '{print $4}'`
  time=`cat $TEMPFILE | awk -F, '{print $5}'`
  array_lines=($lines)
  array_time=($time)
  array_lat=($lat)
  array_lon=($lon)
  array_date=($date)
  count=${#array_lines[@]}  # Number of data records
 for i in `seq 1 $count`; do 
    idx=$(($i-1))
    echo ${array_lines[$idx]} ${array_date[$idx]} ${array_time[$idx]} ${array_lat[$idx]}  ${array_lon[$idx]} `sed $((${array_lines[$idx]}-5))","$((${array_lines[$idx]}-1))"!d" < $INFILE | awk '{print $5}'` `sed $((${array_lines[$idx]}+1))","$((${array_lines[$idx]}+5))"!d" < $INFILE | awk '{print $5}'`
  done > $OUTFILE
  rm -f $TEMPFILE  # Remove the temporary file
  let count++;
done

In the above script for cross checking, I have included the line number from the input file with ${array_lines[$idx]} in the code. I started running my above shell script on server. It is more than two days but still not even a single input file (with 7,000,000 lines) was completed. Till now around 1.5 million lines were only written to my OUTFILE. Just pulling all the GPS lines (i.e., strings of length = 80) and writing to a TEMPFILE is taking 1 minute while the extraction of 5th element from above & below 5 lines of instrument counter data and arranging as specified is taking long.

Here in this forum, I really require someone who can suggest/ correct my above code so that my computation will be faster. I have already posted my script in my previous post but for a different query (http://stackoverflow.com/questions/17612290/how-to-extract-lines-in-between-a-file-read-a-part-of-a-string-with-linux-shel). So, don't be panic about cross-posting. Please suggest me how I can extract the data as shown above from my input file structure in very less time of computation from huge file.

I just feel like leaving a comment on your question that is unrelated to the solution — Jakob Bowyer, Jul 18 '13 at 9:51
Why did you select the tags python and perl while your script is using awk and bash? Are you open to a solution in those languages instead? — Maxime, Jul 18 '13 at 9:54
You write "Here in this forum", but StackOverflow is not a forum. Apart from that, I would recommend to rewrite the whole thing e.g. in Perl. — Slaven Rezic, Jul 18 '13 at 9:54
Sorry but shell script isn't the right tool if you want to get performance. You'll end up spending your time forking to call command like awk sed. Try out a real programming language such as Perl / Python / C. — hivert, Jul 18 '13 at 9:56
Madhavan: it is very simple. People want to help and ironically point out that your question is not a good question. You did not isolate a precise question. Sometimes, one has to ask generally, like you, but you also have not been able to write up your problem in a way so that it is easy for people to follow and understand. Finally, your comment "don't try commenting unless you have intention to help me solve this problem!" is just an affront against all the people here that are willing to help (basically everyone). — Jan-Philip Gehrcke, Jul 18 '13 at 10:32

Jan-Philip Gehrcke · Answer 1 · 2013-07-18 10:05:27Z

The details of your question are entirely unclear to me. I have understood, however, that you have a huge input file which you need to parse and translate into some output. The input is so large that it becomes important to do the parsing and translation in an efficient way.

As you have already seen, spawning subprocesses for every single part of line processing is entirely inefficient (this is what you do in your shell script, when you pipe things around between sed and awk). Also, it seems to me that you read certain parts of the input data multiple times.

You need to use a high-level programming language such as Python (which I really recommend here) and then use an idiom such as

with open('input.txt') as f:
    for line in file:
        your_process_function_for_a_line(line)

This way, the data is processed while reading the file. You can create some data structures before starting this loop, e.g. a dictionary in which you want to store your output, and then populate this data structure in the loop outlined above. Or, which would be even better, you could write your output file while reading the input file. The idiom would like like so:

with open('output.txt', 'w') as outfile:
    with open('input.txt') as infile:
        for line in infile:
            # Process input line(s) and perform something to
            # generate a line for the output file.
            if outline:
                outfile.write(outline)

glenn jackman · Answer 2 · 2013-07-18 12:33:28Z

This should be a lot speedier: it only has to read each line in each .bin file a single time:

for infile in "$DATAPATH"/*.bin; do 
    outfile="$OUTPATH/$(basename "$infile" ".bin")1.txt" 
    awk -F'[ ,]' ' 
        NF==16 { 
            printf "%s %s %s %s ", $1,$2,$7,$9
            printf "%s %s %s %s %s ", prev5,prev4,prev3,prev2,prev1
            for (i=1; i<=5; i++) { getline; printf "%s ", $5 } 
            print ""                                                      
            next                
        }                           
        { prev5=prev4; prev4=prev3; prev3=prev2; prev2=prev1; prev1=$5 } 
    ' <"$infile" >"$outfile" 
done

You could also see if this perl is faster than the awk:

perl -F'/[\s,]/' -lane '
    if (@F == 16) {
        @fields = ($F[0], $F[1], $F[6], $F[8], @prev);
        do { $_ = <>; push @fields, (split)[4] } for (1..5);
        print join(" ", @fields)
    } 
    else { push @prev, $F[4]; shift @prev if @prev > 5 }
'

Kevin · Answer 3 · 2013-07-18 13:20:57Z

Awk can do this.

/,/ {
        split($1,x,",");
        split($3,y,",");
        line = "";
        for(i = n - 5; i < n; i++) {
                if(i >= 0) line = line " " a[i]
        }
        printf "%s %s %s %s %s", x[1],x[2],y[4],y[6],line;
        z=5;
        next
}
{
        a[n++] = $5;
        if(z-- > 0) {
                printf " %s",$5
                if(z == 0) print ""
        }       
}

asked	10 months ago
viewed	78 times
active	10 months ago

current community

your communities

more stack exchange communities

How to improve the computation speed of my shell script/program?

migrated from stackoverflow.com Jul 24 '13 at 8:19

3 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged python linux perl shell or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

How to improve the computation speed of my shell script/program?

migrated from stackoverflow.com Jul 24 '13 at 8:19

3 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python linux perl shell or ask your own question.

Related

Hot Network Questions