Trying multiple regexes against a single string

Question

I have a huge list of regexes (>1,000 but <1,000,000) that I want to test against (many) single strings.

It is unlikely and unintended that more than one such expression would match a single string. I could just maintain a big list of each compiled, individual regex and iterate over that for every input string. However I have it in my head that I should be handing the problem over to the regex compiler to simplify the common substrings since it can (at least theoretically) produce a very neat single DFA.

import re
import uuid

class multiregex(object):
    def __init__(self,rules):
        merge = []
        self._messages = {}
        for regex,text in rules:
            name = "g"+str(uuid.uuid4()).replace('-','')
            merge += ["(?P<%s>%s)" % (name,regex)]
            self._messages[name] = text

        self._re=re.compile('|'.join(merge))

    def __call__(self,s):
        result = self._re.search(s)
        if result:
            groups = result.groupdict()
            return ((self._messages[x], groups[x]) for x in groups.keys() if groups[x]).next()

rules = [("foobar", "Hit a foobar"),
         ("f.*b.*r", "fbr"),
         ("foob.z", "Frobination"),
         ("baz", "Hit a baz"),
         ("b(ingo)?", "b with optional ingo")]

m=multiregex(rules)   
tests=["foobar", "foobaz", "foobazr", "b", "bingo"]
for text,hit in (m(x) for x in tests):
    print "Message: '%s' (because of '%s')" % (text,hit)

The code above works, but am I have a few outstanding issues with it:

Is it needlessly over complicating the whole thing? Or is it pushing the problem off to code that's heavily researched and optimised.
Is there a neater way of finding just the named capture group that matched than what I've done with groupdict()?
Are there any more gotchas than the obvious one of two 'rules' each containing the same group name? e.g.:
```
rules = [("(?P<hello>foobar)", "Hit a foobar"),
         ("(?P<hello>foob.z)", "Frobination")]
```

(The issue of a single syntax error in a single 'rule' killing the whole thing is easy enough to workaround by validating the inputs at rule creation time)

If you have > 1000 regular expressions to test, are you really sure that you want to use regular expressions? It sounds to me like you are building a parser of some kind, in which case regex is not the way to go I believe. — Simon André Forsberg, Feb 1 at 11:43
@SimonAndréForsberg regex is definitely right here - they're essentially heuristics contributed by (many) domain experts and the input strings are too unstructured to do anything smarter with a proper lexer/parser. — Flexo, Feb 1 at 12:02
What is the purpose of the regexes? In your code above it prints that a match happened because of a specific rule... Is this needed, or is just 'It Matches!!!' OK (i.e. do you need to know which rule matched)? — rolfl♦, Feb 1 at 16:29
@rolfl the knowledge of which one matched is important - the messages need to get relayed back to users, which is why I return a tuple of the message and the text that matched. — Flexo, Feb 1 at 16:30
In which case, combining the regexes in to a larger DFA will likely lose you that possibility.... right? — rolfl♦, Feb 1 at 16:34

Quentin Pradet · Answer 1 · 2014-02-12 10:28:10Z

I think it's a neat idea, because you're indeed using well-tested code, which reduces the chance of errors.
Looking at the re API, you do need to retrieve all possibles matches using groupdict().
Even that one is not a gotcha since you're naming your groups yourself. Right? I can't think of anything else.

Other comments:

You have a small bug in your last loop where text and hit need to be exchanged.
There's no easy to way to be sure that the DFA version will be faster than the normal version. Since you're not producing a factorized DFA but a sum of DFAs, naive code could be as slow as testing each regex one by one. Of course it's possible that it's much faster, but measure it if this is what you want to achieve!
You don't need uuids, a simple counter would be enough. Eg ?P<g1> instead of ?P<fd897214dd9d4dd28f591a412ef5d3ea>.

I did the UUID thing just in case someone else had put a group name inside their regex. — Flexo, Feb 12 at 16:37

asked	3 months ago
viewed	131 times
active	3 months ago

current community

your communities

more stack exchange communities

Trying multiple regexes against a single string

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python strings regex or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Trying multiple regexes against a single string

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python strings regex or ask your own question.

Related

Hot Network Questions