Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

Can anyone help me with Regex ? I got an Java Program, that reads in .csv files to load it into a Database.

Currently ist uses Pattern csvPattern = Pattern.compile("\\s*(\"[^\"]*\"|[^|]*)\\s*,?");

but wenn I do matcher = csvPattern.matcher(line); to read the files line by line I only get null-values. The files have the following format (many morre lines, some with comma in it, '|' as a seperator and at the end of each line):

abstact of the first file:

0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|
1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon|
2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |

second:

|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref|
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov|
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou|

third:

5|Supplier#000000005|Gcdm2rJRzl5qlTVzc|11|21-151-690-3663|-283.84|. slyly regular pinto bea|
6|Supplier#000000006|tQxuVm7s7CnK|14|24-696-997-4969|1365.79|final accounts. regular dolphins use against the furiously ironic decoys. |
7|Supplier#000000007|s,4TicNGB4uO6PaSqNBUq|23|33-990-965-2201|6820.35|s unwind silently furiously regular courts. final requests are deposits. requests wake quietly blit|
8|Supplier#000000008|9Sq4bBH2FQEmaFOocY45sRTxo6yuoG|17|27-498-742-3860|7627.85|al pinto beans. asymptotes haggl|
9|Supplier#000000009|1KhUgZegwM3ua7dsYmekYBsK|10|20-403-398-8662|5302.37|s. unusual, even requests along the furiously regular pac|

fourth:

1|2|3325|771.64|, even theodolites. regular, final theodolites eat after the carefully pending foxes. furiously regular deposits sleep slyly. carefully bold realms above the ironic dependencies haggle careful|
1|2502|8076|993.49|ven ideas. quickly even packages print. pending multipliers must have to are fluff|
1|5002|3956|337.09|after the fluffily ironic deposits? blithely special dependencies integrate furiously even excuses. blithely silent theodolites could have to haggle pending, express requests; fu|
1|7502|4069|357.84|al, regular dependencies serve carefully after the quickly final pinto beans. furiously even deposits sleep quickly final, silent pinto beans. fluffily reg|

fifth:

1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |

sixth:

134823|saddle midnight thistle honeydew lime|Manufacturer#4|Brand#43|STANDARD BURNISHED BRASS|44|WRAP CAN|1857.82|ges. furiously ir|
134824|coral red indian thistle sandy|Manufacturer#5|Brand#55|PROMO BURNISHED COPPER|29|LG JAR|1858.82|final p|
134825|saddle purple orchid cornsilk medium|Manufacturer#4|Brand#44|PROMO POLISHED NICKEL|21|LG CASE|1859.82|nal accounts us|
134826|turquoise sky lime cornsilk peach|Manufacturer#1|Brand#11|SMALL BURNISHED TIN|25|SM CAN|1860.82| haggle|

seventh:

0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1|AMERICA|hs use ironic, even requests. s|

eighth:

4|136777|O|32151.78|1995-10-11|5-LOW|Clerk#000000124|0|sits. slyly regular warthogs cajole. regular, regular theodolites acro|
5|44485|F|144659.20|1994-07-30|5-LOW|Clerk#000000925|0|quickly. bold deposits sleep slyly. packages use slyly|

(the csv were created using the DBGen-tool from TPC fpr tpc-h, in case you wonder)

I hope you understand what I need and can help me out on this. Thank you very much!

EDIT: Using String.split("|");' sure seems obvious, but the thing is, that the programm I'm working with is quite complex and relies on the regex.pattern and regex.matcher at various parts. So since I'm not very familiar with the program and java itself, the only solution for me is to use the given code and just replace the regular expression by one that works for me.

EDIT2: the thing is I'm trying to use this TPC-H implementation from OLTP-Bench: https://github.com/ben-reilly/oltpbench/blob/master/src/com/oltpbenchmark/benchmarks/tpch/TPCHLoader.java#L347

where the problematic line is 347. It's a full implemetation of the TPC-H database benchmark, but without a data generator. So I use the dbgen tool provided by the TPC to generate the csv files. I can't get in contact with the developer sadly.

share|improve this question
2  
Is there a reason you are using regex instead of String.split()? –  Mike Elofson Jun 24 at 13:57
    
Is there a reason why you don't just use a CSV parser? –  assylias Jun 24 at 14:09
    
Can you explain what part of splitting requires regex? For example, it is common for some tools to export CSV files without properly escaping values, and then you're stuck with a mess where you can't just do a straight split on delimiters. –  MxyL Jun 24 at 14:50
    
no sorry. I can't since I realy don't know what I'm doing : ( I updated the post –  user3332816 Jun 24 at 15:04
add comment

2 Answers

up vote 0 down vote accepted

Given your source, you could probably just replace the comma with a pipe, since from the comments, all that pattern does is split the string on a delimiter (except the ones in double quotes)

eg: from

\\s*(\"[^\"]*\"|[^,]*)\\s*,?

to

\\s*(\"[^\"]*\"|[^|]*)\\s*\\|?

As for your number exception, you need to debug the way you're calling the CSV loader.

I've never used that tool before, but if you look at line 352

for (int i = 0; i < types.length; ++i) {

Now look at the switch block that starts at line 362: it defines the types that each field should be casted to.

switch(types[i]) {
    case DOUBLE:
        prepStmt.setDouble(i+1, Double.parseDouble(field));
        break;
...

This type of conversion is likely going to cause issues if you don't properly specify the types.

share|improve this answer
    
This is the right idea, with two caveats: First, MxyL undid some of the escaping in the string... Secondly, a zero-width string would match this expression. Since all of your samples end with a pipe, I would eliminate the final ?. Final answer: \\s*(\"[^\"]*\"|[^|]*)\\s*\|. –  dcsohl Jun 24 at 15:55
    
I put the escapes back in, but I think that makes it less readable. –  MxyL Jun 24 at 16:06
    
That's one of the woes of doing them in Java... I would have been fine if you had stripped the escapes out of the first line, FWIW. I just thought it confusing that you seemingly suggested changing the escaped version into the non-escaped version. –  dcsohl Jun 24 at 16:16
    
@user3332816 Your issue is not related to regex anymore. You need to debug your code. –  MxyL Jun 24 at 17:07
    
@dcsohl I see what you mean. Yes the consistency would be confusing. –  MxyL Jun 24 at 18:19
show 1 more comment

I would suggest using String.split("|");. This will give you an array of strings representing the text around your line.

If however, you really want to use a regex:

Pattern csvPattern = Pattern.compile("\\s*(\\d*)\\|(([^|]+)\\|)+");

This one should match a number(or none), a pipe, and then a repeating pattern of Something|

share|improve this answer
    
thaks for your suggestion! sadly it does not work for me. sorry, that I wasn't clear enough. I edited my original post. –  user3332816 Jun 24 at 14:39
    
woah theres a lot of possibilities for what can be inbetween those pipes. Lemme take another look –  Adam Yost Jun 24 at 14:42
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.