I have a huge list of regexes (>1,000 but <1,000,000) that I want to test against (many) single strings.
It is unlikely and unintended that more than one such expression would match a single string. I could just maintain a big list of each compiled, individual regex and iterate over that for every input string. However I have it in my head that I should be handing the problem over to the regex compiler to simplify the common substrings since it can (at least theoretically) produce a very neat single DFA.
import re
import uuid
class multiregex(object):
def __init__(self,rules):
merge = []
self._messages = {}
for regex,text in rules:
name = "g"+str(uuid.uuid4()).replace('-','')
merge += ["(?P<%s>%s)" % (name,regex)]
self._messages[name] = text
self._re=re.compile('|'.join(merge))
def __call__(self,s):
result = self._re.search(s)
if result:
groups = result.groupdict()
return ((self._messages[x], groups[x]) for x in groups.keys() if groups[x]).next()
rules = [("foobar", "Hit a foobar"),
("f.*b.*r", "fbr"),
("foob.z", "Frobination"),
("baz", "Hit a baz"),
("b(ingo)?", "b with optional ingo")]
m=multiregex(rules)
tests=["foobar", "foobaz", "foobazr", "b", "bingo"]
for text,hit in (m(x) for x in tests):
print "Message: '%s' (because of '%s')" % (text,hit)
The code above works, but am I have a few outstanding issues with it:
- Is it needlessly over complicating the whole thing? Or is it pushing the problem off to code that's heavily researched and optimised.
- Is there a neater way of finding just the named capture group that matched than what I've done with
groupdict()
? Are there any more gotchas than the obvious one of two 'rules' each containing the same group name? e.g.:
rules = [("(?P<hello>foobar)", "Hit a foobar"), ("(?P<hello>foob.z)", "Frobination")]
(The issue of a single syntax error in a single 'rule' killing the whole thing is easy enough to workaround by validating the inputs at rule creation time)