JavaScript: avoiding empty strings with String.split, and regular expression precedence

Question

I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise be.

For example, "***".split(/(\*)/) -> ["", "*", "", "*", "", "*", ""]. Is there a way to avoid this?

Another issue is the expression precedence in the regular expression itself. Let's say I am trying to parse a C style multi-line comment. That is, /* comment */. Now let's assume the input string is "/****/". If I were to use the following regular expression, it would work, but produce a lot of extra tokens (and all those empty strings!).

/(\/\*|\*\/|\*)/

A better way is to read /*'s, */'s and then read all the rest of the *'s in one token. That is, the better result for the above string is ["/*", "**", "*/"]. However, when using the regular expression that should do this, I get bad results. The regular expression is like so: /(\/\*|\*\/|\*+)/.

The result of this expression is however this: ["/*", "***", "/"]. I am guessing this is because the last part is greedy so it steals the match from the other part.

The only solution I found was to make a negated lookahead expression, like this:

/(\/\*|\*\/|\*+(?!\/)/

This gives the expected result, but it is very slow compared to the other one, and this has an effect for big strings.

Is there a solution for either of these problems?

"****".split().join().split(''); I know this probably is not what you need. but seems to work. — james emanon, Nov 12 '13 at 1:49

georg · Answer 1 · 2013-11-12 00:38:21Z

up vote 0 down vote

Generally for tokenizing you use match, not split:

> str = "/****/"
"/****/"
> str.match(/(\/\*)(.*?)(\*\/)/)
["/****/", "/*", "**", "*/"]

Also note how the non-greedy modifier ? solves the second problem.

answered Nov 12 '13 at 0:38

georg
63k979153

I am tokenizing gigantic strings, the C style multiline comment was just one example, how would that regex fit with the rest? (the rest being {}()[]+-/&^=! and so on). In the mean time I just stopped grabbing all *'s with the greedy modifier, while the rest of the tokens do use it. The worst case is when there are many of *'s, but since there are usually not many of them, I guess I can live with that. For the record, this is currently my regex: /[ \t]+|\[+|\]+|\{+|\}+|\(+|\)+|\++|\-+|\*|\/+|<+|>+|&+|\|+|=+|!+|\\/ – user2503048 Nov 12 '13 at 1:00

@user2503048: well, there are quite a few syntax highlighters in javascript, maybe you can start by studying their source code? Your regexp looks sub-optimal. – georg Nov 12 '13 at 1:23

add a comment |

anubhava · Answer 2 · 2013-11-11 23:42:29Z

up vote 2 down vote

Use lookahed to avoid empty matches:

arr = "***".split(/(?=\*)/);
//=> ["*", "*", "*"]

OR use filter(Boolean) to discard empty matches:

arr = "***".split(/(\*)/).filter(Boolean);
//=> ["*", "*", "*"]

answered Nov 11 '13 at 23:42

anubhava
249k2488163

The lookahead sadly doesn't let me to get multiple things in the same token, e.g. "***" as one token in the above example. I used filter() previously, but the additional array traversal actually made the whole code slower than it is when doing one iteration while skipping the empty strings. – user2503048 Nov 11 '13 at 23:54

Anu -- sorry to bother you with this, but can you explain or point me to a resource that explains the .filter(Boolean) concept. Are the empty strings returned as false etc.. I'm slightly confused by your usage. – james emanon Nov 12 '13 at 1:31

3

Empty strings evaluate as false in boolean expressions. – user2503048 Nov 12 '13 at 1:34

@user2503048: Can you provide some example inputs and expected matches in your question. I am sure lookahead can still be used with those captures. – anubhava Nov 12 '13 at 6:29

add a comment |

asked	1 year ago
viewed	329 times
active	1 year ago

current community

your communities

more stack exchange communities

JavaScript: avoiding empty strings with String.split, and regular expression precedence

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged javascript regex split tokenize or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

JavaScript: avoiding empty strings with String.split, and regular expression precedence

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged javascript regex split tokenize or ask your own question.

Visit Chat

Related

Hot Network Questions