Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I have a small module that gets the lemma of a word and its plural form. It then searches through sentences looking for a sentence that contains both words (singular or plural) in either order. I have it working but I was wondering if there is a more elegant way to build this expression.

Note: Python2

words = ((cell,), (wolf,wolves))
string1 = "(?:"+"|".join(words[0])+")"
string2 = "(?:"+"|".join(words[1])+")"
pat = ".+".join((string1, string2)) +"|"+ ".+".join((string2, string1))
# Pat output: "(?:cell).+(?:wolf|wolves)|(?:wolf|wolves).+(?:cell)"

Then the search:

pat = re.compile(pat)
for sentence in sentences:
    if len(pat.findall(sentence)) != 0:
        print sentence+'\n'

Alternatively, would this be a good solution?

words = ((cell,), (wolf,wolves))
for sentence in sentences:
    sentence = sentence.lower()
    if any(word in sentence for word in words[0]) and any(word in sentence for word in words[1]):
        print sentence
share|improve this question

2 Answers 2

You could use findall with a pattern like (cell)|(wolf|wolves) and check if every group was matched:

words = (("cell",), ("wolf","wolves"))
pat = "|".join(("({0})".format("|".join(forms)) for forms in words))
regex = re.compile(pat)
for sentence in sentences:
    matches = regex.findall(sentence)
    if all(any(groupmatches) for groupmatches in zip(*matches)):
        print sentence
share|improve this answer
    
A step further than me. Seems good to me. –  eyquem Dec 8 '13 at 20:17
    
Thanks! I will try this. –  Jesse Travis Dec 8 '13 at 21:42

Maybe, you will find this way of writing more easy to read:

words = (('cell',), ('wolf','wolves'))

string1 = "|".join(words[0]).join(('(?:',')'))
print string1

string2 = "|".join(words[1]).join(('(?:',')'))
print string2

pat = "|".join((
                ".+".join((string1, string2)) ,
                ".+".join((string2, string1))
                ))
print pat

My advice is also to use '.+?' instead of just '.+'. It will spare time to the regex motor when it will run through the analysed string: it will stop as soon as it will encouters the following unary pattern.

Another adavantage is that it can be easily extended when there are several couples noun/plural.

share|improve this answer
    
Silly question but isn't ".+?" the same thing as ".*" ? –  Josay Dec 8 '13 at 19:50
1  
@Josay No. See in this link : (docs.python.org/2/library/re.html#regular-expression-syntax) .+ is greedy, .+? is ungreedy. It means that in case of '..cell....wolf.......' analysed, the regex motor of pattern (?:cell).+(?:wolf|wolves) will match cell and then .+ will match all the subsequent characters, dots and wolf comprised, until the end of the string; there it will realize that it can't match (?:wolf|wolves) with anything else. So it will move backward and to search again in order to find such a pattern. –  eyquem Dec 8 '13 at 20:06
1  
Then pattern (?:cell).+(wolf\d|wolves) will match 'wolf2' in ',,cell,,wolf1,,,wolf2,,,' while (?:cell).+?(wolf\d|wolves) will match 'wolf1' –  eyquem Dec 8 '13 at 20:09
    
Thanks for the additional explanation! TIL. –  Josay Dec 8 '13 at 20:16
    
Thank you, I will use .+? –  Jesse Travis Dec 8 '13 at 21:43

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.