Consider an input string like
Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5
and the regular expression
\b(TWO|FOUR)=([^ ]*)\b
Using this regular expression, the following code can extract the 2 specific key-value pairs out of the 5 total ones (i.e., only some predefined key-value pairs should be extracted).
public static void main(String[] args) throws Exception {
String input = "Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5";
String regex = "\\b(TWO|FOUR)=([^ ]*)\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println("\t" + matcher.group(1) + " = " + matcher.group(2));
}
}
More specifically, the main()
method above prints
TWO = 2
FOUR = 4
but every time find()
is invoked, the whole regular expression is evaluated for the part of the string remaining after the latest match, left to right.
Also, if the keys are not mutually distinct (or, if a regular expression with overlapping matches was used in the place of each key), there will be multiple matches. For instance, if the regex becomes
\b(O.*?|T.*?)=([^ ]*)\b
the above method yields
ONE = 1
TWO = 2
THREE = 3
If the regex was not fully re-evaluated but each alternative part was somehow examined once (or, if an appropriately modified regex was used), the output would have been
ONE = 1
TWO = 2
So, two questions:
- Is there a more efficient way of extracting a selected set of unique keys and their values, compared to the original regular expression?
- Is there a regular expression that can match every alternative part of the OR (
|
) sub-expression exactly once and not evaluate it again?