I am looking for a Perl (5.8.8) script for CSV parsing that would follow CVS standards.
(See Wikipedia or RFC-4180 for details)
Sure, code should be fast enough, but more important is that it should use no libraries at all.
This is what I have for now :
#!/usr/bin/perl
use strict;
use warnings;
sub csv {
no warnings 'uninitialized';
my ($x, @r) = (pop, ()); my $s = $x ne '';
$x =~ s/\G(?:(?:\s*"((?:[^"]|"")*)"\s*)|([^",\n]*))(,|\n|$)/{
push @r, $1.$2 if $1||$2||$s; $s = $3; ''}/eg;
$r[$_] =~ s/"./"/g for 0..@r-1;
$x? undef : @r;
}
@test = csv( '"one",two,,"", "three,,four", five ," si""x",,7, "eight",' .
' 9 ten,, ' . "\n" . 'a,b,,c,,"d' . "\n" . 'e,f",g,' );
(!defined $test[0])? die : print "|$_|\n" for @test;
Same code, but with comments :
#!/usr/bin/perl
use strict;
use warnings;
sub csv {
no warnings 'uninitialized';
# we can use uninitialized string variable as an empty without warning
my ($x, @r) = (pop, ()); my $s = $x ne '';
# function argument (input string) goes to $x
# result array @r = ()
# variable $s indicates if we (still) have something to parse
$x =~ s/\G(?:(?:\s*"((?:[^"]|"")*)"\s*)|([^",\n]*))(,|\n|$)/{
# match double-quoted element or non-quoted element
# double-quoted element can be surrounded with spaces \s* that are ignored
# and such element is any combination of characters with no odd sequence
# of double-quote character ([^"]|"")*
# non-quoted element is any combination of characters others than double-quote
# character, comma or new-line character ([^",\n]*)
# element is followed by comma or new-line character (for non-quoted elements)
push @r, $1.$2 if $1||$2||$s;
# if match found, push it to @r result array
$s = $3;
# do we (still) have something to parse?
''
# replace match with empty string, so at the end we can check if all is done
}/eg;
# /e = execute { ... } for each match, /g = repeatedly
$r[$_] =~ s/"./"/g for 0..@r-1;
# replace double double-quotes with double-quote only
$x? undef : @r;
# if $x is not empty, then CSV syntax error occurred and function returns undef
# otherwise function returns array with all matches
}
@test = csv( '"one",two,,"", "three,,four", five ," si""x",,7, "eight",' .
' 9 ten,, ' . "\n" . 'a,b,,c,,"d' . "\n" . 'e,f",g,' );
# simple test
(!defined $test[0])? die : print "|$_|\n" for @test;
# die if csv returns an error, otherwise print elements surrounded with pipe char |
The code gets the following output:
|one|
|two|
||
||
|three,,four|
| five |
| si"x|
||
|7|
|eight|
| 9 ten|
||
| |
|a|
|b|
||
|c|
||
|d
e,f|
|g|
||
All improvements will be appreciated.
l
(el) as a variable because it looks like a1
(one). – Apprentice Queue Mar 28 '12 at 6:08