Problem:
Your problem is that each component of your current regex essentially matches a number or [a-z] word, separated by anything that isn't [a-z], which includes commas. So your parts for a two word city are:
Input:
19951010 19951011 Red City, WI Description
Your components:
String re1="(\\d+)"; // Integer Number 1
String re2=".*?"; // Non-greedy match on filler
String re3="(?:[a-z][a-z]+)"; // Uninteresting: word
String re4=".*?"; // Non-greedy match on filler
String re5="(?:[a-z][a-z]+)"; // Uninteresting: word
String re6=".*?"; // Non-greedy match on filler
String re7="((?:[a-z][a-z]+))"; // Word 1
What they match:
re1: "19951010"
re2: " 19951011 "
re3: "Red" (stops at non-letter, e.g. whitespace)
re4: " "
re5: "City" (stops at non-letter, e.g. the comma)
re6: ", " (stops at word character)
re7: "WI"
But with a one-word city:
Input:
19951010 19951011 Pittsburgh, PA Description
What they match:
re1: "19951010"
re2: " 19951011 "
re3: "Pittsburgh" (stops at non-letter, e.g. the comma)
re4: ","
re5: "PA" (stops at non-letter, e.g. whitespace)
re6: " " (stops at word character)
re7: "Description" (but you want this to be the state)
Solution:
You should do two things. First, simplify your regex a bit; you are going kind of crazy specifying greedy vs. reluctant, etc. Just use greedy patterns. Second, think about the simplest way to express your rules.
Your rules really are:
- Date
- A bunch of characters that aren't a comma (including second date and city name).
- A comma.
- State (one word).
So build a regex that sticks to that. You can, as you are doing now, take a shortcut by skipping the second number, but note that you do lose support for cities that start with numbers (which probably won't happen). Also you don't care about the state. So, e.g.:
String re1 = "(\\d+)"; // match first number
String re2 = "[^,]*"; // skip everything thats not a comma
String re3 = ","; // skip the comma
String re4 = "[\\s]*"; // skip whitespace
String re5 = "([a-z]+)"; // match letters (state)
String regex = re1 + re2 + re3 + re4 + re5;
There are other options as well, but I personally find regular expressions to be very straightforward for things like this. You could use various combinations of split()
, as other posters have detailed. You could directly look for commas and whitespace with indexOf()
and pull out substrings. You could even convince a Scanner
or perhaps a StringTokenizer
or StreamTokenizer
to work for you. However, regular expressions exist to solve problems like this and are a good tool for the job.
Here is an example with StringTokenizer
:
StringTokenizer t = new StringTokenizer(txt, " \t");
String date = t.nextToken();
t.nextToken(); // skip second date
t.nextToken(","); // change delimiter to comma and skip city
t.nextToken(" \t"); // back to whitespace and skip comma
String state = t.nextToken();
Still, I feel a regex expresses the rules more cleanly.
By the way, for future debugging, sometimes it helps to just print out all of the capture groups, this can give you insight into what is matching what. A good technique is to put every component of your regex in a capture group temporarily, then print them all out.