Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I was reading this article today on two different regular expression algorithms.

According to the article old Unix tools like ed, sed, grep, egrep, awk, and lex, all use what's called the Thompson NFA algorithm in their regular expresssions...

However newer tools like Java, Perl, PHP, and Python all use a different algorithm for their regular expressions that are much, much slower.

This article makes no mention at all of Javascript's regex algorthim, (and yes I know there are various JS engines out there) but I was wondering if anybody knew which of those algorithms they use, and if maybe those algorithms should be swapped out for Thompson NFA.

share|improve this question

4 Answers 4

up vote 7 down vote accepted

The Javascript ECMA language description doesn't impose a requirement for the particular implementation of regular expressions, so that part of the question isn't well-formed. You're really wondering about the particular implementation in a particular browser.

The reason Perl/Python etc use a slower algorithm, though, is that the regex language defined isn't really regular expressions. A real regular expression can be expressed as a finite state machine, but the language of regex is context free. That's why the fashion is to just call it "regex" instead of talking about regular expressions.

Update

Yes, in fact javascript regex isn't content free regular. Consider the syntax using `{n,m}', that is, matches from n to m accepted regexs. Let d the difference d=|n-m|. The syntax means there exists a string uxdw that is acceptable, but a string uxk>dw that is not. It follows via the pumping lemma for regular languages that this is not a regular language.

(augh. Thinko corrected.)

share|improve this answer
1  
Actually, they're more than full regular expressions. "{n,m}", for example, can't be represented in an FSA for arbitrary n,m. –  Charlie Martin Apr 7 '09 at 21:51
1  
@Charlie Well done Charlie! Wow. –  leeand00 Apr 8 '09 at 3:39
1  
Your "update" about {n,m} is wrong. x{3,5} can be written as xxx|xxxx|xxxxx which is perfectly regular and handled perfectly well with a DFA engine. –  Jan Goyvaerts Apr 8 '09 at 9:41
3  
The unbounded x{3,} can be rewritten as xxxx* which is regular and can be implemented with a DFA with 4 states. Try it at osteele.com/tools/reanimator –  Jan Goyvaerts Apr 10 '09 at 6:38
1  
I think that the demonstration is bogus. Since n and m are fixed when the regex is defined, you always can find an integer p > 0 (that only depends on the regex) such that every string w of length at least p and that matches the regex, can be written as w = xyz, |y| > 0, |xy| ≤ p, for all i ≥ 0, xy{i}z matches the regex. In my view, it's the x(?=y) syntax that is not regular. –  boumbh Feb 5 at 9:24

Looking at Wikipedia I found out it's "Traditional NFA". A simple question deserves simple answer :)

share|improve this answer
    
Hey that's neat. thanks! –  leeand00 Apr 7 '09 at 21:08
    
That doesn't make it true. I don't think the javascript language is really finite state. –  Charlie Martin Apr 7 '09 at 21:09
    
Digging for truth or searching for answers? What is this site all about? –  Peter Perháč Apr 7 '09 at 21:11
    
@Charlie @MasterPeter Hmm...maybe that's why it isn't on the list of programs on Wikipedia...(and also why the list lacks browsers...) –  leeand00 Apr 7 '09 at 21:17
    
In the mean time, had a look at the syntax. It's definitely not finite state. –  Charlie Martin Apr 7 '09 at 21:53

Perl uses a memoized recursive backtracking search and, as of some improvements in 5.10, no longer blows up on perl -e '("a" x 100000) =~ /^(ab?)*$/;'. In recent tests I performed on an OS X box, Perl 5.10 outperformed awk, even in the cases where awk's algorithm was supposed to be better.

share|improve this answer
    
So that article is most likely outta date then... –  leeand00 Apr 7 '09 at 21:08

Though the ECMA standard does not specify the algorithm an ECMAScript implementation should use, the fact that the standard mandates that ECMAScript regular expressions must support backreferences (\1, \2, etc.) rules out the DFA and "Thompson NFA" implementations.

share|improve this answer
    
Backreferences rule out DFA (deterministic finite automaton), but there are other ways to solve the problem (e.g. recursive backtracking). Perl uses memoized backtracking recursion which removes a lot of the downsides to recursive backtracking (still eats a lot of memory on certain patterns though). –  Chas. Owens Apr 8 '09 at 11:56

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.