I have written a shell script to process my huge data files (each one having around 7,000,000 lines ~ a weeks data in a single file). Below is the sample of my data file (i.e., input file) structure:
808 836 204 325 148 983
908 836 203 326 148 986
8 835 204 325 149 984
108 835 204 325 148 984
208 836 204 326 149 984
308 834 203 325 149 985
408 836 204 326 149 983
508 834 203 325 149 985
20130402,123358 0 $GPRMC,123358.000,A,5056.3056,N,00622.5644,E,0.00,0.00,020413,,,A*67
608 834 203 325 150 985
708 834 204 326 150 986
808 836 204 325 151 983
908 835 204 325 153 984
8 816 202 325 153 973
108 836 204 324 156 984
208 835 204 325 157 983
308 834 202 324 158 985
408 835 203 325 158 985
508 836 203 324 160 984
20130402,123359 0 $GPRMC,123359.000,A,5056.3056,N,00622.5644,E,0.01,0.00,020413,,,A*67
608 835 204 325 162 986
708 836 204 324 164 983
808 835 202 324 165 986
908 836 204 324 167 983
8 836 202 324 168 985
108 835 203 325 170 986
208 836 203 324 171 983
I have an instrument whose counter provides the data every 0.1 second while the GPS provides its measurement in between every 1 second. For every GPS measurement, I wanted to extract the above 5 lines and below 5 lines of my instrument record simultaneously. I don't want all 6 elements from my instrument record. From my instrument record, I require only the 5th elements of the above & below 5 lines against each GPS record. In addition, from the GPS record, I want to extract date (1st element), time (2nd element), latitude and longitude. So, from the above sample example, I shall obtain as below:
20130402 123358 5056.3056 00622.5644 148 149 149 149 149 150 150 151 153 153
20130402 123359 5056.3056 00622.5644 156 157 158 158 160 162 164 165 167 168
In order to extract & arrange the data as shown above, I have initially written a MATLAB & IDL codes. Then again I wrote a shell script as below:
#!/bin/bash
clear
# Program to read *.bin files
# Data directory
DATAPATH='/home/xyz/datasets/2013_04_09/'
# Output data directory
OUTPATH='/home/abc/data_1/'
count=1;
# Read the files sequentially
for file in $DATAPATH*.bin; do
INFILE=$file; # Input file
INFILENAME=`echo $INFILE | awk -F'/' '{print $7}'`
SUFFIX="1.txt"
OUTFILE="$OUTPATH${INFILENAME:0:18}$SUFFIX" # Output file
TEMPFILE="${OUTPATH}tempresult_sed.txt"
awk '{if(length($0) >= 79) print NR,",",$0}' $INFILE | sed 's/ /,/g' > $TEMPFILE
lines=`cat $TEMPFILE | awk -F, '{print $1}'`
lat=`cat $TEMPFILE | awk -F, '{print $10}'`
lon=`cat $TEMPFILE | awk -F, '{print $12}'`
date=`cat $TEMPFILE | awk -F, '{print $4}'`
time=`cat $TEMPFILE | awk -F, '{print $5}'`
array_lines=($lines)
array_time=($time)
array_lat=($lat)
array_lon=($lon)
array_date=($date)
count=${#array_lines[@]} # Number of data records
for i in `seq 1 $count`; do
idx=$(($i-1))
echo ${array_lines[$idx]} ${array_date[$idx]} ${array_time[$idx]} ${array_lat[$idx]} ${array_lon[$idx]} `sed $((${array_lines[$idx]}-5))","$((${array_lines[$idx]}-1))"!d" < $INFILE | awk '{print $5}'` `sed $((${array_lines[$idx]}+1))","$((${array_lines[$idx]}+5))"!d" < $INFILE | awk '{print $5}'`
done > $OUTFILE
rm -f $TEMPFILE # Remove the temporary file
let count++;
done
In the above script for cross checking, I have included the line number from the input file with ${array_lines[$idx]} in the code. I started running my above shell script on server. It is more than two days but still not even a single input file (with 7,000,000 lines) was completed. Till now around 1.5 million lines were only written to my OUTFILE. Just pulling all the GPS lines (i.e., strings of length = 80) and writing to a TEMPFILE is taking 1 minute while the extraction of 5th element from above & below 5 lines of instrument counter data and arranging as specified is taking long.
Here in this forum, I really require someone who can suggest/ correct my above code so that my computation will be faster. I have already posted my script in my previous post but for a different query (http://stackoverflow.com/questions/17612290/how-to-extract-lines-in-between-a-file-read-a-part-of-a-string-with-linux-shel). So, don't be panic about cross-posting. Please suggest me how I can extract the data as shown above from my input file structure in very less time of computation from huge file.
awk
sed
. Try out a real programming language such as Perl / Python / C. – hivert Jul 18 '13 at 9:56