Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

Consider an input string like

Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5

and the regular expression

\b(TWO|FOUR)=([^ ]*)\b

Using this regular expression, the following code can extract the 2 specific key-value pairs out of the 5 total ones (i.e., only some predefined key-value pairs should be extracted).

  public static void main(String[] args) throws Exception {
    String input = "Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5";
    String regex = "\\b(TWO|FOUR)=([^ ]*)\\b";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
      System.out.println("\t" + matcher.group(1) + " = " + matcher.group(2));
    }
  }

More specifically, the main() method above prints

TWO = 2
FOUR = 4

but every time find() is invoked, the whole regular expression is evaluated for the part of the string remaining after the latest match, left to right.

Also, if the keys are not mutually distinct (or, if a regular expression with overlapping matches was used in the place of each key), there will be multiple matches. For instance, if the regex becomes

\b(O.*?|T.*?)=([^ ]*)\b

the above method yields

ONE = 1
TWO = 2
THREE = 3

If the regex was not fully re-evaluated but each alternative part was somehow examined once (or, if an appropriately modified regex was used), the output would have been

ONE = 1
TWO = 2

So, two questions:

  1. Is there a more efficient way of extracting a selected set of unique keys and their values, compared to the original regular expression?
  2. Is there a regular expression that can match every alternative part of the OR (|) sub-expression exactly once and not evaluate it again?
share|improve this question
    
Just to understand where you're coming from... Is this question out of intellectual curiosity, or are you actually worried that evaluating each branch of the OR is too slow for you? –  zx81 Jul 19 at 4:01
    
All of that and more, with a real problem as the starting point. Namely, extracting a predefined set of keys (and their values) from long strings that have numerous other key-value pairs that should be ignored an interesting and not rare use case. –  PNS Jul 19 at 4:03

1 Answer 1

up vote 2 down vote accepted

Java Returns a Match Position: You can Use Dynamically-Generated Regex on Remaining Substrings

With the understanding that it can be generalized to a more complex and useful scenario, let's take a variation on your first example: \b(TWO|FOUR|SEVEN)=([^ ]*)\b

You can use it like this:

Pattern regex = Pattern.compile("\\b(TWO|FOUR|SEVEN)=([^ ]*)\\b");
Matcher regexMatcher = regex.matcher(yourString);
if (regexMatcher.find()) {
    String theMatch = regexMatcher.group();
    String FoundToken =  = regexMatcher.group(1);
    String EndPosition = regexMatcher.end();
} 

You could then:

  • Test the value contained by FoundToken
  • Depending on that value, dynamically generate a regex testing for the remaining possible tokens. For instance, if you found FOUR, your new regex would be \\b(TWO|SEVEN)=([^ ]*)\\b
  • Using EndPosition, apply that regex to the end of the string.

Discussion

  • This approach would serve your goal of not re-evaluating parts of the OR that have already matched.
  • It also serves your goal of avoiding duplicates.
  • Would that be faster? Not in this simple case. But you said you are dealing with a real problem, and it will be a valid approach in some cases.
share|improve this answer
    
Yes, regeneration of the regular expression for the remainder of the parsed string should work, as you suggest. Please edit to make the code compilable (foundToken and endPosition need to be declared somewhere). It would be a bit more exciting if there was a way of doing this in "one pass", but +1 for the contribution anyway. Thanks. :-) –  PNS Jul 19 at 4:22
    
Yeah, I had three implicit String something = null; off-screen, outside the code block. :) Just added String to the three results. –  zx81 Jul 19 at 4:25
    
Sure. Just being pedantic. :-) –  PNS Jul 19 at 4:30
    
FYI: Changed the TWO|FOUR example to TWO|FOUR|SEVEN so the dynamic nature of the regexes used is clearer to a later reader. Also expanded discussion of the approach. –  zx81 Jul 19 at 4:30
    
Read it, thanks. –  PNS Jul 19 at 4:31

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.