Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise be.

For example, "***".split(/(\*)/) -> ["", "*", "", "*", "", "*", ""]. Is there a way to avoid this?

Another issue is the expression precedence in the regular expression itself. Let's say I am trying to parse a C style multi-line comment. That is, /* comment */. Now let's assume the input string is "/****/". If I were to use the following regular expression, it would work, but produce a lot of extra tokens (and all those empty strings!).

/(\/\*|\*\/|\*)/

A better way is to read /*'s, */'s and then read all the rest of the *'s in one token. That is, the better result for the above string is ["/*", "**", "*/"]. However, when using the regular expression that should do this, I get bad results. The regular expression is like so: /(\/\*|\*\/|\*+)/.

The result of this expression is however this: ["/*", "***", "/"]. I am guessing this is because the last part is greedy so it steals the match from the other part.

The only solution I found was to make a negated lookahead expression, like this:

/(\/\*|\*\/|\*+(?!\/)/

This gives the expected result, but it is very slow compared to the other one, and this has an effect for big strings.

Is there a solution for either of these problems?

share|improve this question
    
"****".split().join().split(''); I know this probably is not what you need. but seems to work. –  james emanon Nov 12 '13 at 1:49

2 Answers 2

Generally for tokenizing you use match, not split:

> str = "/****/"
"/****/"
> str.match(/(\/\*)(.*?)(\*\/)/)
["/****/", "/*", "**", "*/"]

Also note how the non-greedy modifier ? solves the second problem.

share|improve this answer
    
I am tokenizing gigantic strings, the C style multiline comment was just one example, how would that regex fit with the rest? (the rest being {}()[]+-/&^=! and so on). In the mean time I just stopped grabbing all *'s with the greedy modifier, while the rest of the tokens do use it. The worst case is when there are many of *'s, but since there are usually not many of them, I guess I can live with that. For the record, this is currently my regex: /[ \t]+|\[+|\]+|\{+|\}+|\(+|\)+|\++|\-+|\*|\/+|<+|>+|&+|\|+|=+|!+|\\/ –  user2503048 Nov 12 '13 at 1:00
    
@user2503048: well, there are quite a few syntax highlighters in javascript, maybe you can start by studying their source code? Your regexp looks sub-optimal. –  georg Nov 12 '13 at 1:23

Use lookahed to avoid empty matches:

arr = "***".split(/(?=\*)/);
//=> ["*", "*", "*"]

OR use filter(Boolean) to discard empty matches:

arr = "***".split(/(\*)/).filter(Boolean);
//=> ["*", "*", "*"]
share|improve this answer
    
The lookahead sadly doesn't let me to get multiple things in the same token, e.g. "***" as one token in the above example. I used filter() previously, but the additional array traversal actually made the whole code slower than it is when doing one iteration while skipping the empty strings. –  user2503048 Nov 11 '13 at 23:54
    
Anu -- sorry to bother you with this, but can you explain or point me to a resource that explains the .filter(Boolean) concept. Are the empty strings returned as false etc.. I'm slightly confused by your usage. –  james emanon Nov 12 '13 at 1:31
3  
Empty strings evaluate as false in boolean expressions. –  user2503048 Nov 12 '13 at 1:34
    
@user2503048: Can you provide some example inputs and expected matches in your question. I am sure lookahead can still be used with those captures. –  anubhava Nov 12 '13 at 6:29

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.