Java regex of string

Question

I want to parse strings to get fields from them. The format of the string (which come from a dataset) is as so (the -> represents a tab, and the * represents a space):

Date(yyyymmdd)->Date(yyyymmdd)->*City,*State*-->Description

I am only interested in the 1st date and the State. I tried regex like this:

String txt="19951010    19951011     Red City, WI                 Description";

    String re1="(\\d+)";    // Integer Number 1
    String re2=".*?";   // Non-greedy match on filler
    String re3="(?:[a-z][a-z]+)";   // Uninteresting: word
    String re4=".*?";   // Non-greedy match on filler
    String re5="(?:[a-z][a-z]+)";   // Uninteresting: word
    String re6=".*?";   // Non-greedy match on filler
    String re7="((?:[a-z][a-z]+))"; // Word 1

    Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6+re7,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    Matcher m = p.matcher(txt);
    if (m.find())
    {
        String int1=m.group(1);
        String word1=m.group(2);
        System.out.print("("+int1.toString()+")"+"("+word1.toString()+")"+"\n");
    }

It works fine id the city has two words (Red City) then the State is extracted properly, but if the City only has one word it does not work. I can't figure it out, I don't need to use regex and am open to any other suggestions. Thanks.

Jason C · Accepted Answer · 2013-11-17 03:01:57Z

Problem:

Your problem is that each component of your current regex essentially matches a number or [a-z] word, separated by anything that isn't [a-z], which includes commas. So your parts for a two word city are:

Input: 
  19951010 19951011 Red City, WI Description

Your components:
  String re1="(\\d+)";    // Integer Number 1
  String re2=".*?";   // Non-greedy match on filler
  String re3="(?:[a-z][a-z]+)";   // Uninteresting: word
  String re4=".*?";   // Non-greedy match on filler
  String re5="(?:[a-z][a-z]+)";   // Uninteresting: word
  String re6=".*?";   // Non-greedy match on filler
  String re7="((?:[a-z][a-z]+))"; // Word 1

What they match:
  re1: "19951010"
  re2: " 19951011 "
  re3: "Red" (stops at non-letter, e.g. whitespace)
  re4: " "
  re5: "City" (stops at non-letter, e.g. the comma)
  re6: ", " (stops at word character)
  re7: "WI"

But with a one-word city:

Input: 
  19951010 19951011 Pittsburgh, PA Description

What they match:
  re1: "19951010"
  re2: " 19951011 "
  re3: "Pittsburgh" (stops at non-letter, e.g. the comma)
  re4: ","
  re5: "PA" (stops at non-letter, e.g. whitespace)
  re6: " " (stops at word character)
  re7: "Description" (but you want this to be the state)

Solution:

You should do two things. First, simplify your regex a bit; you are going kind of crazy specifying greedy vs. reluctant, etc. Just use greedy patterns. Second, think about the simplest way to express your rules.

Your rules really are:

Date
A bunch of characters that aren't a comma (including second date and city name).
A comma.
State (one word).

So build a regex that sticks to that. You can, as you are doing now, take a shortcut by skipping the second number, but note that you do lose support for cities that start with numbers (which probably won't happen). Also you don't care about the state. So, e.g.:

String re1 = "(\\d+)";   // match first number
String re2 = "[^,]*";    // skip everything thats not a comma
String re3 = ",";        // skip the comma
String re4 = "[\\s]*";   // skip whitespace
String re5 = "([a-z]+)"; // match letters (state)

String regex = re1 + re2 + re3 + re4 + re5;

There are other options as well, but I personally find regular expressions to be very straightforward for things like this. You could use various combinations of split(), as other posters have detailed. You could directly look for commas and whitespace with indexOf() and pull out substrings. You could even convince a Scanner or perhaps a StringTokenizer or StreamTokenizer to work for you. However, regular expressions exist to solve problems like this and are a good tool for the job.

Here is an example with StringTokenizer:

StringTokenizer t = new StringTokenizer(txt, " \t");
String date = t.nextToken();
t.nextToken(); // skip second date
t.nextToken(","); // change delimiter to comma and skip city
t.nextToken(" \t"); // back to whitespace and skip comma
String state = t.nextToken();

Still, I feel a regex expresses the rules more cleanly.

By the way, for future debugging, sometimes it helps to just print out all of the capture groups, this can give you insight into what is matching what. A good technique is to put every component of your regex in a capture group temporarily, then print them all out.

Jason C · Answer 2 · 2013-11-17 02:49:34Z

up vote 0 down vote

no need to be so complex with this. you can split on whitespace!

//s is your string
String[] first = s.split("\\s*,\\s*")
String[] firstHalf = first[0].split("\\s+")
String[] secondHalf = first[1].split("\\s+")
String date = firstHalf[0]
String state = secondHalf[0]

now you have youre date and your state! do with them what you want.

edited Nov 17 '13 at 2:49

Jason C
6,5921032

answered Nov 17 '13 at 2:33

Ryan Saxe
2,229219

2

This has the same issue that the OP's regex does, where it fails if the city was not two words. What you would want to do is split the string at the comma, then split each of those strings at whitespace, then take the first token in each of those. – Jason C Nov 17 '13 at 2:40

fair enough. I was basing this simply on the placements of the "*"s and "->"s in the question – Ryan Saxe Nov 17 '13 at 2:41

1

Now this should work fine! Thank you for pointing this out, I should have read the question more carefully! – Ryan Saxe Nov 17 '13 at 2:44

add comment

hwnd · Answer 3 · 2013-11-17 03:11:24Z

You can do this by using the split() method.

String s = "19951010    19951011     Red City, WI                 Description";
String[] parts = s.split("(?<![^\\dA-Z,])\\s+");
System.out.println(parts[0] + ", " + parts[3]);

Regular expression:

(?<!           look behind to see if there is not:
 [^\dA-Z,]     any character except: digits (0-9), 'A' to 'Z', ','
)              end of look-behind
\s+            whitespace (\n, \r, \t, \f, and " ") (1 or more times)

See working demo

If you decide to continue going the route of matching, you could use the following.

String s = "19951010    19951011     Red City, WI                 Description";
Pattern p = Pattern.compile("^(\\d+)[^,]*,\\s*([^\\s]+)");
Matcher m = p.matcher(s);
while (m.find()) {
  System.out.println(m.group(1) + ", " + m.group(2));
}

Regular expression:

^             the beginning of the string
(             group and capture to \1:
 \d+          digits (0-9) (1 or more times)
)             end of \1
[^,]*         any character except: ',' (0 or more times)
,             ','
\s*           whitespace (\n, \r, \t, \f, and " ") (0 or more times)
[^\s]+        any character except: whitespace (\n, \r, \t, \f, and " ") (1 or more times)

See working demo

asked	1 month ago
viewed	67 times
active	1 month ago

Explore our sites

Java regex of string

3 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged java regex parsing mapreduce or ask your own question.

Hot Network Questions

Explore our sites

Java regex of string

3 Answers

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged java regex parsing mapreduce or ask your own question.

Related

Hot Network Questions