Configuring fields to be extracted by a data processor

Question

I solved this problem using closures with Perl and I wonder if/how I could/should have done it differently?

Background:

User creates config file describing properties of some source data file that would be processed:

Type of file delimited (and the delim) or fixed width
How many fields to expect
Which fields are to be pulled out

Examples on the last (if delimited):

Fields: FIRST_NAME 2
Fields: LAST_NAME 3
Fields: ADDRESS 4

or (fixed width):

Fields: FIRST_NAME 58,30
Fields: LAST_NAME  88,30
Fields: ADDRESS    118,50

My Perl reads the config, and prepares ahead the action to take once I start reading the source data file.

Internally, I'm creating preparing my master hash (per the config), which I'll loop on for each record when reading the source.

Now to handle the said examples, I have this code, if the source file would be delimited:

sub ret_col_sub
{
    my ($col_name, $col_pos) = @_

    warn qq($col_name col_pos='$col_pos' is not a number) if $col_pos =~ /\D/;

    $col_pos--;

    # Return closure to caller, which will be called later when we're reading the source data file
    return sub
    {
        my $rec = shift;

        return $rec->[$col_pos];
    };
}

or fixed width:

sub ret_substr_sub
{
    my ($col_name, $substr_params) = @_;

    my ($offset, $len) = $substr_params =~ /(\d+)\s*,\s*(\d+)/;

    # The offset might be ZERO, check for length ... but $len must always be P O S I T I V E !
    warn qq($col_name substr_params='$substr_params' is missing the offset and length needed by substr) if not length $offset or not $len;

    $offset--;
    $len--;

    # Return closure to caller, which will be called later when we're reading the source data file
    return sub
    {
        my $rec = shift;

        return substr $rec, $offset, $len;
    };
}

Is there a better or another way of doing this, or any other suggestions?

Borodin · Accepted Answer · 2016-03-04 20:28:30Z

This seems like a desperate attempt to find a use for closures, and I believe your problem would be solved much more simply by making $rec an object, perhaps even a tied object that lets you overload indexing

Some observations:

It makes the API much less useful when the calling code has to know which form of data it is dealing with anyway. ret_col_sub returns a closure that expects a simple string for its parameter, while ret_substr_sub returns a closure that requires an array reference. All of that should be hidden from the caller
I can see that ret_col_sub won't compile
Checking that a string doesn't match \D isn't sufficient to ensure that it is numeric, as an empty string will pass that test
It is correct to rebase the substring offset, but a length is a length and that should be left untouched
The postfixed ++ and -- operators have become popular because of the name of the C++ language, but unless you are using the value returned by tge operator you should be using tge prefixed version. It is wrong to rely on the compiler to optimise out the storage of the unused original value
As has been said, your style contravenes common standards, in particular those laid out in perlstyle which are the most universal

Mithaldu · Accepted Answer · 2016-03-04 04:41:45Z

Syntactically that looks completely fine, though without seeing the context in which you use these, that's a little hard to judge.

Also be aware that \d \D count, i.e. Arabic numerals as decimals, so you may want [0-9].

Stylistically they could be improved a bit, since the braces use a snowflake style (i.e. not Perl::Tidy standard (though that is fine if workplace mandated and enforced by a perltidy config file)), and the closure at the end being a bit more wordy than strictly necessary.

I'd style this code as follows:

sub ret_col_sub {
    my ( $col_name, $col_pos ) = @_;
    warn qq($col_name col_pos='$col_pos' is not a number) if $col_pos =~ /[^0-9]/;
    $col_pos--;

    # Return closure to caller, which will be called later when we're reading the source data file
    return sub { shift->[$col_pos] };
}

As an afterthought, I'm also questioning the need to create custom file formats, instead of just using JSON/YAML/WhateverElseIsOnCPAN.

Stack Exchange Network

Configuring fields to be extracted by a data processor

2 Answers 2

Your Answer

Hot Network Questions

Configuring fields to be extracted by a data processor

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions