Reading a dynamic record length binary file

Question

I am trying to read a dynamic record length binary file, which has data like this:

field1field2field3field4vector1vector2

Here, field3 defines the occurrence of vectors.

For example, field3 is 2, then vector1&2 would be present and if the value is 3 then vector1,2&3 would be present.

Let's say field1-4 is fixed, the length is 2652 and each vector length is 301. field3 is 3 bytes from position 2396.

I have written the below code which is working fine but giving very bad performance.

my $string;
my $rep_factor;
my $size;

open (FILE, $ARGV[0]) or die $!;
my $re = 2396;
my $rec = 0;
while (<FILE>) {
   seek(FILE,$re,0);
   read(FILE,$rep_factor,2);
   my $rep_fact = undefined2defined(convert2ascii_decimal($rep_factor,0));
   $size = ($rep_fact * 301) + 2652;
   seek(FILE,$rec,0);
   read FILE,$string,$size;

   filewrite ($ARGV[1], recordparse($string));
   $rec = $size + $rec;
   $re = $size + 2396;
}

Could you post helper functions convert2ascii_decimal and undefined2defined as well? — 200_success♦, Jan 12 at 18:01
It would also help to know what filewrite and recordparse do. — Brythan, Jan 12 at 18:53
“field3 is 3 bytes from position 2396” Do you mean field3 is 3 bytes long at offset 2396 within the record? But your code reads only 2 bytes into $repfactor. Is there a practical maximum for the value of field3 that your code can rely on? — Borodin, Feb 19 at 11:43

Brythan · Answer 1 · 2015-01-13 23:29:30Z

open (FILE, $ARGV[0]) or die $!;

It won't really make a difference in this script, but the standard is to use a variable rather than a bareword.

open (my $fh, $ARGV[0]) or die $!;

This gives you the benefits of lexical scoping.

while (<FILE>) {

What this says is while there are still lines in <FILE>, do the things in the block. It separates lines by the line ending marker, which is normally a linefeed. This would normally be used:

while ( my $line = <$fh> ) {
    chomp $line;

Then you could break the line up into parts with unpack.

    my ( $field1and2, $vector_count, $field4, $vector_field ) = unpack( 'A2395 A3 A254 A*', $line );

Note that I may have the format wrong. It's easier if one can test against a valid file. You may not even need that much:

    my ( undef, $vector_field ) = unpack( 'A2652 A*', $line );

It seems like you only use the vectors portion.

Instead, you use the while loop just to keep count. However, that wouldn't normally work unless records were terminated with linefeeds.

open (my $fh, $ARGV[0]) or die $!;
while ( my $line = <$fh> ) {
    chomp $line;

    filewrite ($ARGV[1], recordparse($line));
}

See how that's simpler? No more seeks. This doesn't even require unpack, as it looks like you pass the whole line to recordparse.

One more thing:

filewrite ($ARGV[1], recordparse($string));

I'm not sure what this function does, but it looks like it takes a file name and writes to it. This suggests that it opens the file each time. Better would be to open the file once and pass a filehandle to it.

If you were getting multiple records with the original code but there are no linefeeds in the file, then I suspect that the seek and read commands are resetting the line input operator.

Alternate version to handle reading the entire file at once (assumes no linefeeds in the data).

open (my $fh, $ARGV[0]) or die $!;
my $file = <$fh>;
while ( '' ne $file ) {
    my $rep_factor = substr $file, 2396, 3;
    my $vector_count = undefined2defined(convert2ascii_decimal($rep_factor, 0));
    my $size = ($vector_count * 301) + 2652;

    my $record = substr $file, 0, $size;
    $file = substr $file, $size;

    filewrite ($ARGV[1], recordparse($record));
}

This uses substr rather than unpack because it's easier to specify different lengths with substr. The first version asks for $size bytes starting at the beginning. The second version says to return everything but the first $size bytes.

I'm a little worried that there is some maximum length that the <> operator returns, but I don't know what it might be. Since I don't have an example file, I'm not testing the code at all.

Since this is a binary file ...there are no line terminator...if i do wc -l <file>, it gives me 1 record. — ashish, Jan 13 at 2:56
I tried with the code, but its only returning 1 record in outout — ashish, Jan 13 at 2:57
@ashish I don't have time tonight, but I'll take another look in the morning. This would be easier with a copy of the file. Note that you read the entire file into memory for the first iteration of the loop. You should either process that string, or stop creating it. I'm not sure which is more efficient. I continue to believe that there is an unpack based solution. — Brythan, Jan 13 at 4:15

asked	2 months ago
viewed	42 times
active	2 months ago

current community

your communities

more stack exchange communities

Reading a dynamic record length binary file

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged beginner performance perl or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Reading a dynamic record length binary file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged beginner performance perl or ask your own question.

Related

Hot Network Questions