Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I know JavaScript regular expressions have native lookaheads but not lookbehinds.

I want to split a string at points either beginning with any member of one set of characters or ending with any member of another set of characters.

Split before , , , , . Split after .

In: ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ

Out: ເລື້ອຍໆມະ ຫັດສະ ຈັນ ເອກອັກຄະ ລັດຖະ ທູດ

I can achieve the "split before" part using zero-width lookahead:

'ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ'.split(/(?=[ໃໄໂເແ])/)

["ເລື້ອຍໆມະຫັດສະຈັນ", "ເອກອັກຄະລັດຖະທູດ"]

But I can't think of a general approach to simulating zero-width lookbehind

I'm splitting strings of arbitrary Unicode text so don't want to substitute in special markers in a first pass, since I can't guarantee the absence of any string from my input.

share|improve this question

3 Answers 3

up vote 2 down vote accepted

Instead of spliting, you may consider using the match() method.

var s = 'ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ',
    r = s.match(/(?:(?!ະ).)+?(?:ະ|(?=[ໃໄໂເແ]|$))/g);

console.log(r); //=> [ 'ເລື້ອຍໆມະ', 'ຫັດສະ', 'ຈັນ', 'ເອກອັກຄະ', 'ລັດຖະ', 'ທູດ' ]
share|improve this answer
1  
Much cleaner than my answer. +1. –  Mark Reed Aug 29 '14 at 3:08

You could try matching rather than splitting,

> var re = /((?:(?!ະ).)+(?:ະ|$))/g;
undefined
> var str = "ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ"
undefined
> var m;
undefined
> while ((m = re.exec(str)) != null) {
... console.log(m[1]);
... }
ເລື້ອຍໆມະ
ຫັດສະ
ຈັນເອກອັກຄະ
ລັດຖະ
ທູດ

Then again split the elements in the array using lookahead.

share|improve this answer
    
This doesn't give the correct results... –  hwnd Aug 29 '14 at 3:03
    
RegExp.prototype.exec is inconsistently implemented across some browsers on the web, String.prototype.match is generally preferred –  aduch Aug 29 '14 at 3:04
2  
@hwnd may i know in what cases the above regex would fail. –  Avinash Raj Aug 29 '14 at 3:05

If you use parentheses in the delimited regex, the captured text is included in the returned array. So you can just split on /(ະ)/ and then concatenate each of the odd members of the resulting array to the preceding even member. Example:

"ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູ".split(/(ະ)/).reduce(function(arr,str,index) {
   if (index%2 == 0) { 
     arr.push(str); 
   } else { 
     arr[arr.length-1] += str
   }; 
   return arr;
 },[])

Result: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນເອກອັກຄະ", "ລັດຖະ", "ທູ"]

You can do another pass to split on the lookahead:

"ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູ".split(/(ະ)/).reduce(function(arr,str,index) {
   if (index%2 == 0) { 
     arr.push(str); 
   } else { 
     arr[arr.length-1] += str
   }; 
   return arr;
 },[]).reduce(function(arr,str){return arr.concat(str.split(/(?=[ໃໄໂເແ])/));},[]);

Result: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນ", "ເອກອັກຄະ", "ລັດຖະ", "ທູ"]

share|improve this answer
    
In a first pass before then doing a pass on the lookahead part? That's what I'm playing with right now (-: ... –  hippietrail Aug 29 '14 at 2:58
    
Yes. I added a working second pass to my answer. –  Mark Reed Aug 29 '14 at 3:07
    
There's one way in which this solution isn't general. If the "end" pattern can be a varying number of characters. This doesn't happen in my current iteration but may do so in the future, and more general solutions are more betterer (-: ... Then again I did specify "character" in my question. –  hippietrail Aug 29 '14 at 3:07
    
I don't see how that would matter. The split pattern could just as easily be an alternation ... whatever the actual delimiter is in each case, it will still be included and appended to the previous string. –  Mark Reed Aug 29 '14 at 3:09
    
You're right. I thought I saw something hard-coded on the length of ະ but not. –  hippietrail Aug 29 '14 at 3:15

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.