Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

Given a String such as

String input = "one two three four five six seven";

Is there a regex that works with String.split() to grab (up to) two words at a time, such that:

String[] pairs = input.split("some regex");
System.out.println(Arrays.toString(pairs));

results in this:

[one two, three four, five six, seven]

Note: This question is about the split regex. It is not about "finding a work-around" or other "making it work another way" solutions.

share|improve this question
5  
Why? Is this a puzzle or a real problem? – Stephen C May 10 at 15:40
2  
It's a puzzle... but it interested me enough to ask it, because look-behinds must be bounded in length, so it seems like a non-trivial problem. – Bohemian May 10 at 15:42
3  
+1. Very interesting question. – Maroun Maroun May 10 at 15:46
3  
Java's look-behind is one of the strangest beast. In .NET, you can freely look-behind for variable length. In PCRE, you can only look-behind for fixed length. In Java, due to bug/feature in implementation of + and *, you sometimes can match variable length pattern: stackoverflow.com/questions/1536915/… – nhahtdh May 12 at 9:37

5 Answers

up vote 44 down vote accepted

Is this what you are looking for?
(you can replace \\w with \\S to include all non-space characters but for this example I will leave \\w since it is easier to analyze regex with \\w\\s then \\S\\s)

String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\\G\\w+)\\s");
System.out.println(Arrays.toString(pairs));

output:

[one two, three four, five six, seven]

\G is previous match, (?<!regex) is negative lookbehind.

In split we are trying to

  1. find spaces -> \\s
  2. that are not predicted -> (?<!negativeLookBehind)
  3. by some word -> \\w+
  4. with previously matched (space) -> \\G
  5. before it ->\\G\\w+.

Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \\G at start matches start of the String ^.

So before first iteration regex in negative look-behind will look like (?<!^\\w+) and since first space do have ^\\w+ before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input String) will be stored in \\G and used later in next negative look-behind.

So for 3rd space regex will check if there is previously matched space \\G and word \\w+ before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \\G (it will have different position in input String).


Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)

input.split("(?<=\\G\\w{1,100}\\s\\w{1,100}\\s\\w{1,100})\\s")

Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.

share|improve this answer
1  
+1 Very nice!!! – Bohemian May 10 at 16:07
You should probably change \w+ to \S+ in case the notional "words" aren't in fact words. Also, could you add a detailed description/explanation of why this works? It's a great regex, it would be good to make sure everyone understands it thoroughly too. – Bohemian May 10 at 21:29
1  
As an incentive, if you add a good explanation within the next hour, I'll throw in a +50 bounty bonus! :) (actually, that's rubbish - I'm awarding the bounty anyway - you deserve it, because I for one learned something) – Bohemian May 10 at 22:03
@Bohemian Could you check if explanation in my updated answer is sufficient? – Pshemo May 10 at 22:12
That's a great explanation. You rock! – Bohemian May 10 at 23:04

This will work, but maximum word length needs to be set in advance:

String input = "one two three four five six seven eight nine ten eleven";
String[] pairs = input.split("(?<=\\G\\S{1,30}\\s\\S{1,30})\\s");
System.out.println(Arrays.toString(pairs));

I like Pshemo's answer better, being shorter and usable on arbitrary word lengths, but this (as @Pshemo pointed out) has the advantage of being adaptable to groups of more than 2 words.

share|improve this answer
I/m giving you a +1, but it doesn't answer the question of having arbitrarily long words. At least you got something working though. – Bohemian May 10 at 16:07
2  
+1 for answer that can be adapted easily to any number of words that should be grouped. – Pshemo May 10 at 19:41

You can try this:

[a-z]+\s[a-z]+

Updated:

([a-z]+\s[a-z]+)|[a-z]+

enter image description here

Updated:

 String pattern = "([a-z]+\\s[a-z]+)|[a-z]+";
 String input = "one two three four five six seven";

 Pattern splitter = Pattern.compile(pattern);
 String[] results = splitter.split(input);

 for (String pair : results) {
 System.out.println("Output = \"" + pair + "\"");
share|improve this answer
Will this grab the seven not matched with a 2nd word pair? – Walls May 10 at 15:32
I updated, now it should match – Alex May 10 at 15:36
6  
This does not answer the question. Your regex matches the target content, but split() requires a regex to match the separators. Your regex does not work (with split()) – Bohemian May 10 at 15:41

The regexp passed to split is used to find delimiters, not contents.

Just split the string into word tokens and append pairs.

share|improve this answer
1  
That doesn't answer the question – Bohemian May 10 at 15:32

You can try this:

(\w+\s\w+)|\w+
share|improve this answer
This won't work. – Maroun Maroun May 10 at 15:48
1  
This does not answer the question. Your regex matches the target content, but split() requires a regex to match the separators. Your regex does not work with split() – Bohemian May 10 at 15:49
This has the same issue as Alex's answer - it matches the pattern not the space between patterns. It does not work with split. – Boris the Spider May 10 at 15:49

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.