Java regex for matching multiple keys in a string

Question

Consider an input string like

Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5

and the regular expression

\b(TWO|FOUR)=([^ ]*)\b

Using this regular expression, the following code can extract the 2 specific key-value pairs out of the 5 total ones (i.e., only some predefined key-value pairs should be extracted).

  public static void main(String[] args) throws Exception {
    String input = "Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5";
    String regex = "\\b(TWO|FOUR)=([^ ]*)\\b";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
      System.out.println("\t" + matcher.group(1) + " = " + matcher.group(2));
    }
  }

More specifically, the main() method above prints

TWO = 2
FOUR = 4

but every time find() is invoked, the whole regular expression is evaluated for the part of the string remaining after the latest match, left to right.

Also, if the keys are not mutually distinct (or, if a regular expression with overlapping matches was used in the place of each key), there will be multiple matches. For instance, if the regex becomes

\b(O.*?|T.*?)=([^ ]*)\b

the above method yields

ONE = 1
TWO = 2
THREE = 3

If the regex was not fully re-evaluated but each alternative part was somehow examined once (or, if an appropriately modified regex was used), the output would have been

ONE = 1
TWO = 2

So, two questions:

Is there a more efficient way of extracting a selected set of unique keys and their values, compared to the original regular expression?
Is there a regular expression that can match every alternative part of the OR (|) sub-expression exactly once and not evaluate it again?

Just to understand where you're coming from... Is this question out of intellectual curiosity, or are you actually worried that evaluating each branch of the OR is too slow for you? — zx81, Jul 19 at 4:01
All of that and more, with a real problem as the starting point. Namely, extracting a predefined set of keys (and their values) from long strings that have numerous other key-value pairs that should be ignored an interesting and not rare use case. — PNS, Jul 19 at 4:03

zx81 · Accepted Answer · 2014-07-19 04:29:32Z

up vote 2 down vote accepted

Java Returns a Match Position: You can Use Dynamically-Generated Regex on Remaining Substrings

With the understanding that it can be generalized to a more complex and useful scenario, let's take a variation on your first example: \b(TWO|FOUR|SEVEN)=([^ ]*)\b

You can use it like this:

Pattern regex = Pattern.compile("\\b(TWO|FOUR|SEVEN)=([^ ]*)\\b");
Matcher regexMatcher = regex.matcher(yourString);
if (regexMatcher.find()) {
    String theMatch = regexMatcher.group();
    String FoundToken =  = regexMatcher.group(1);
    String EndPosition = regexMatcher.end();
}

You could then:

Test the value contained by FoundToken
Depending on that value, dynamically generate a regex testing for the remaining possible tokens. For instance, if you found FOUR, your new regex would be \\b(TWO|SEVEN)=([^ ]*)\\b
Using EndPosition, apply that regex to the end of the string.

Discussion

This approach would serve your goal of not re-evaluating parts of the OR that have already matched.
It also serves your goal of avoiding duplicates.
Would that be faster? Not in this simple case. But you said you are dealing with a real problem, and it will be a valid approach in some cases.

edited Jul 19 at 4:29

answered Jul 19 at 4:14

zx81
22.4k51751

Yes, regeneration of the regular expression for the remainder of the parsed string should work, as you suggest. Please edit to make the code compilable (foundToken and endPosition need to be declared somewhere). It would be a bit more exciting if there was a way of doing this in "one pass", but +1 for the contribution anyway. Thanks. :-) – PNS Jul 19 at 4:22

Yeah, I had three implicit String something = null; off-screen, outside the code block. :) Just added String to the three results. – zx81 Jul 19 at 4:25

Sure. Just being pedantic. :-) – PNS Jul 19 at 4:30

FYI: Changed the TWO|FOUR example to TWO|FOUR|SEVEN so the dynamic nature of the regexes used is clearer to a later reader. Also expanded discussion of the approach. – zx81 Jul 19 at 4:30

Read it, thanks. – PNS Jul 19 at 4:31

| show 5 more comments

asked	20 days ago
viewed	63 times
active	16 days ago

current community

your communities

more stack exchange communities

Java regex for matching multiple keys in a string

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged java regex pattern-matching matcher or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Java regex for matching multiple keys in a string

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged java regex pattern-matching matcher or ask your own question.

Related

Hot Network Questions