Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I have written a shell script to process my huge data files (each one having around 7,000,000 lines ~ a weeks data in a single file). Below is the sample of my data file (i.e., input file) structure:

808 836 204 325 148 983
908 836 203 326 148 986
8 835 204 325 149 984
108 835 204 325 148 984
208 836 204 326 149 984
308 834 203 325 149 985
408 836 204 326 149 983
508 834 203 325 149 985
20130402,123358 0 $GPRMC,123358.000,A,5056.3056,N,00622.5644,E,0.00,0.00,020413,,,A*67
608 834 203 325 150 985
708 834 204 326 150 986
808 836 204 325 151 983
908 835 204 325 153 984
8 816 202 325 153 973
108 836 204 324 156 984
208 835 204 325 157 983
308 834 202 324 158 985
408 835 203 325 158 985
508 836 203 324 160 984
20130402,123359 0 $GPRMC,123359.000,A,5056.3056,N,00622.5644,E,0.01,0.00,020413,,,A*67
608 835 204 325 162 986
708 836 204 324 164 983
808 835 202 324 165 986
908 836 204 324 167 983
8 836 202 324 168 985
108 835 203 325 170 986
208 836 203 324 171 983

I have an instrument whose counter provides the data every 0.1 second while the GPS provides its measurement in between every 1 second. For every GPS measurement, I wanted to extract the above 5 lines and below 5 lines of my instrument record simultaneously. I don't want all 6 elements from my instrument record. From my instrument record, I require only the 5th elements of the above & below 5 lines against each GPS record. In addition, from the GPS record, I want to extract date (1st element), time (2nd element), latitude and longitude. So, from the above sample example, I shall obtain as below:

20130402 123358 5056.3056 00622.5644 148 149 149 149 149 150 150 151 153 153
20130402 123359 5056.3056 00622.5644 156 157 158 158 160 162 164 165 167 168

In order to extract & arrange the data as shown above, I have initially written a MATLAB & IDL codes. Then again I wrote a shell script as below:

#!/bin/bash
clear
# Program to read *.bin files
# Data directory
DATAPATH='/home/xyz/datasets/2013_04_09/'

# Output data directory
OUTPATH='/home/abc/data_1/'
count=1; 

# Read the files sequentially
for file in $DATAPATH*.bin; do
  INFILE=$file;      # Input file   
  INFILENAME=`echo $INFILE | awk -F'/' '{print $7}'`
  SUFFIX="1.txt"
  OUTFILE="$OUTPATH${INFILENAME:0:18}$SUFFIX"   # Output file  
  TEMPFILE="${OUTPATH}tempresult_sed.txt"
  awk '{if(length($0) >= 79) print NR,",",$0}' $INFILE | sed 's/ /,/g' > $TEMPFILE
  lines=`cat $TEMPFILE | awk -F, '{print $1}'`
  lat=`cat $TEMPFILE | awk -F, '{print $10}'`
  lon=`cat $TEMPFILE | awk -F, '{print $12}'`
  date=`cat $TEMPFILE | awk -F, '{print $4}'`
  time=`cat $TEMPFILE | awk -F, '{print $5}'`
  array_lines=($lines)
  array_time=($time)
  array_lat=($lat)
  array_lon=($lon)
  array_date=($date)
  count=${#array_lines[@]}  # Number of data records
 for i in `seq 1 $count`; do 
    idx=$(($i-1))
    echo ${array_lines[$idx]} ${array_date[$idx]} ${array_time[$idx]} ${array_lat[$idx]}  ${array_lon[$idx]} `sed $((${array_lines[$idx]}-5))","$((${array_lines[$idx]}-1))"!d" < $INFILE | awk '{print $5}'` `sed $((${array_lines[$idx]}+1))","$((${array_lines[$idx]}+5))"!d" < $INFILE | awk '{print $5}'`
  done > $OUTFILE
  rm -f $TEMPFILE  # Remove the temporary file
  let count++;
done

In the above script for cross checking, I have included the line number from the input file with ${array_lines[$idx]} in the code. I started running my above shell script on server. It is more than two days but still not even a single input file (with 7,000,000 lines) was completed. Till now around 1.5 million lines were only written to my OUTFILE. Just pulling all the GPS lines (i.e., strings of length = 80) and writing to a TEMPFILE is taking 1 minute while the extraction of 5th element from above & below 5 lines of instrument counter data and arranging as specified is taking long.

Here in this forum, I really require someone who can suggest/ correct my above code so that my computation will be faster. I have already posted my script in my previous post but for a different query (http://stackoverflow.com/questions/17612290/how-to-extract-lines-in-between-a-file-read-a-part-of-a-string-with-linux-shel). So, don't be panic about cross-posting. Please suggest me how I can extract the data as shown above from my input file structure in very less time of computation from huge file.

share|improve this question

migrated from stackoverflow.com Jul 24 '13 at 8:19

This question came from our site for professional and enthusiast programmers.

7  
I just feel like leaving a comment on your question that is unrelated to the solution –  Jakob Bowyer Jul 18 '13 at 9:51
4  
Why did you select the tags python and perl while your script is using awk and bash? Are you open to a solution in those languages instead? –  Maxime Jul 18 '13 at 9:54
1  
You write "Here in this forum", but StackOverflow is not a forum. Apart from that, I would recommend to rewrite the whole thing e.g. in Perl. –  Slaven Rezic Jul 18 '13 at 9:54
4  
Sorry but shell script isn't the right tool if you want to get performance. You'll end up spending your time forking to call command like awk sed. Try out a real programming language such as Perl / Python / C. –  hivert Jul 18 '13 at 9:56
2  
Madhavan: it is very simple. People want to help and ironically point out that your question is not a good question. You did not isolate a precise question. Sometimes, one has to ask generally, like you, but you also have not been able to write up your problem in a way so that it is easy for people to follow and understand. Finally, your comment "don't try commenting unless you have intention to help me solve this problem!" is just an affront against all the people here that are willing to help (basically everyone). –  Jan-Philip Gehrcke Jul 18 '13 at 10:32
show 6 more comments

3 Answers

The details of your question are entirely unclear to me. I have understood, however, that you have a huge input file which you need to parse and translate into some output. The input is so large that it becomes important to do the parsing and translation in an efficient way.

As you have already seen, spawning subprocesses for every single part of line processing is entirely inefficient (this is what you do in your shell script, when you pipe things around between sed and awk). Also, it seems to me that you read certain parts of the input data multiple times.

You need to use a high-level programming language such as Python (which I really recommend here) and then use an idiom such as

with open('input.txt') as f:
    for line in file:
        your_process_function_for_a_line(line)

This way, the data is processed while reading the file. You can create some data structures before starting this loop, e.g. a dictionary in which you want to store your output, and then populate this data structure in the loop outlined above. Or, which would be even better, you could write your output file while reading the input file. The idiom would like like so:

with open('output.txt', 'w') as outfile:
    with open('input.txt') as infile:
        for line in infile:
            # Process input line(s) and perform something to
            # generate a line for the output file.
            if outline:
                outfile.write(outline)
share|improve this answer
    
Thank you Jan-Philip –  Madhavan Jul 18 '13 at 13:37
add comment

This should be a lot speedier: it only has to read each line in each .bin file a single time:

for infile in "$DATAPATH"/*.bin; do 
    outfile="$OUTPATH/$(basename "$infile" ".bin")1.txt" 
    awk -F'[ ,]' ' 
        NF==16 { 
            printf "%s %s %s %s ", $1,$2,$7,$9
            printf "%s %s %s %s %s ", prev5,prev4,prev3,prev2,prev1
            for (i=1; i<=5; i++) { getline; printf "%s ", $5 } 
            print ""                                                      
            next                
        }                           
        { prev5=prev4; prev4=prev3; prev3=prev2; prev2=prev1; prev1=$5 } 
    ' <"$infile" >"$outfile" 
done 

You could also see if this perl is faster than the awk:

perl -F'/[\s,]/' -lane '
    if (@F == 16) {
        @fields = ($F[0], $F[1], $F[6], $F[8], @prev);
        do { $_ = <>; push @fields, (split)[4] } for (1..5);
        print join(" ", @fields)
    } 
    else { push @prev, $F[4]; shift @prev if @prev > 5 }
'
share|improve this answer
    
Thank you Glenn... –  Madhavan Jul 18 '13 at 13:36
add comment

Awk can do this.

/,/ {
        split($1,x,",");
        split($3,y,",");
        line = "";
        for(i = n - 5; i < n; i++) {
                if(i >= 0) line = line " " a[i]
        }
        printf "%s %s %s %s %s", x[1],x[2],y[4],y[6],line;
        z=5;
        next
}
{
        a[n++] = $5;
        if(z-- > 0) {
                printf " %s",$5
                if(z == 0) print ""
        }       
}          
share|improve this answer
    
Thank you Kevin... –  Madhavan Jul 18 '13 at 13:36
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.