More Pythonic version of finding substring with one glob-like expression

Question

Giving the following example list the task is: find out whether the second expression is in the first one. * shall be considered as a wildcard for any character and one should be able to escape it \*
Hello,ell
This is good, is
CodeEval,C*Eval
Old,Young
should evaluate to
true
true
true
false

I wrote the following code which solves this problem. However, my solution seems very cumbersome and slow. Do you know how to improve it? (The task is for CodeEval and this solution is actually too slow. The actual solution which succeeded is much shorter but it does not check if the search_exprs are in the right order)

import sys
import re

def all_occ(ss, s):
    """ Finds All Occurences of substring ss in string s and returns
    for each occurence its last index
    """
    start = 0
    while True:
        try:
            start = s.index(ss,start)+len(ss)
        except ValueError:
            break
        yield start


def is_ss(s, ss):
    search_exprs = [x.replace('\*', '*') for x in re.split(r'(?<!\\)\*', ss)]

    # Get the list of all occurences of each search expression
    find_indices = [list(all_occ(sub, s)) for sub in search_exprs]

    # If any of the indices is not found, then the expression does not match
    if not any(find_indices):
        return "false"

    # Check if the expressions are in the right order
    start = 0
    for el in find_indices:
        start = min([e for e in el if e>start])
        if not start:
            return "false"

    return "true"

for line in open(sys.argv[1]).read().splitlines():
    if line:
        print is_ss(*line.split(','))

Here the actual solution that has been accepted by codeeval but would fail if the expressions were in the wrong order:

import sys
import re

def is_ss(s, ss):
    search_exprs = [x.replace('\*', '*') for x in re.split(r'(?<!\\)\*', ss)]

    if any(map(lambda x: x in s, search_exprs)):
        return "true"
    else:
        return "false"

for line in open(sys.argv[1]).read().splitlines():
    if line:
        print is_ss(*line.split(','))

Should one be able to escape escapes? – icktoofay Jun 21 '14 at 23:47 — icktoofay, Jun 21 '14 at 23:47

200_success · Accepted Answer · 2014-10-20 20:36:48Z

Language interpretation

Should one be able to escape escapes? – icktoofay Jun 21 at 23:47

No, one should not :) – ProfHase85 Jun 22 at 8:26

I agree with @icktoofay's suggestion. In the absence of any indication to the contrary, I would assume that any backslash in the globlike expression should cause the following character to be interpreted literally. That means that \\ (a pair of backslashes) should be interpreted as one literal backslash.

In general, any sane language that has an escaping mechanism must also have a way to let you specify a literal sequence of characters that looks like an escape. Examples:

String literals in C and Python: \\ is a literal backslash.
printf() format strings: %% produces a literal % character in the output.
HTML and XML: & is a literal &.
LDAP search filters: \5c matches a literal \.

If it were not possible to escape escapes, how would one search for a literal \* (backslash, asterisk) in the text? Based on norms established by C string literals, I would expect the globlike expression \\\* to accomplish that goal.

With that interpretation,

re.split(r'(?<!\\)\*', ss)

is insufficient for the task. The interpretation should depend on whether there are an even or odd number of backslashes preceding the asterisk.

Search technique

Anything other than a one-pass search is likely doomed to failure. If you search for literal substrings, you are likely to have problems handling overlapping sequences. Even if you fix the overlap problem, you would likely still have problems dealing with repeating substrings in the pattern. If there are multiple wildcards in the pattern, you would also need to support backtracking.

In short, I don't recommend that approach at all.

Driver

   for line in open(sys.argv[1]).read().splitlines():
       if line:
           print is_ss(*line.split(','))

I recommend using fileinput.input() so that the program will read its data from either stdin or a given filename.

Personally, I would also prefer not to read in the entire file at once, for scalability.

I would consider it good practice to limit split() to two elements. The problem is silent on how a third comma on the line should be interpreted, but I would treat it as a literal character in the globlike expression (even if it will always result in a match failure).

I would also decompose the problem differently. is_ss(s, ss) (confusingly named function and parameters, by the way) is responsible for interpreting ss as a pattern, performing the match, and returning "true" or "false". That's actually two or three separate tasks.

Suggested solution

Since matching individual substrings is a problematic strategy, and writing a pattern matcher from scratch seems like it would be tedious, I've chosen to translate globs into regular expressions. The trick is to escape everything except unescaped asterisks.

As a bonus, obtaining a regular expression converts the task to a solved problem. The performance characteristics of regular expression matching in Python are well known. Pushing the responsibility of performing the match to the main loop is also a benefit, in my opinion.

import fileinput
import itertools
import os
import re

def globlike_to_re(expr):
    def unescaped_globlike_to_re(expr):
        '''Convert a globlike expression to a regex pattern.  The expr
        parameter is guaranteed to contain no backslashes, unless it is the
        last character.  Therefore, any asterisk in expr means "match
        anything", and everything else is taken literally.'''
        return r'.*'.join(re.escape(literal) for literal in expr.split('*'))

    # [glob, literal, glob, literal, ..., glob]
    pairs = re.split(r'\\(.)', expr)

    lang = itertools.cycle([unescaped_globlike_to_re, re.escape])
    return re.compile(''.join(interp(expr) for interp, expr in zip(lang, pairs)))

for line in fileinput.input():
    haystack, needle = line.rstrip(os.linesep).split(',', 2)
    print 'true' if globlike_to_re(needle).search(haystack) else 'false'

Thanks a lot, learned very much from the code – ProfHase85 Oct 21 '14 at 9:25 — ProfHase85, Oct 21 '14 at 9:25

Cranium · Answer 2 · 2014-06-22 05:01:52Z

Your solution seems like an overkill for what is asked. You only need to answer if the second expression is in the first one. In your solution, you are searching for ALL occurrences, no wonder why it takes some extra time :)

For me, the most elegant solution would be like:

def is_ss(s, ss): # the signature can be improved for clarity :)
    return ss in s

Now, you know you can't do that because of the * cse, so you will need some regex to handle this. I think the function can be as "simple" as (not tested though):

def is_ss(s, ss):
    return re.match(re.sub(some_pattern, some_other_pattern, ss), s) is not None

About this piece of code:

re.match returns a MatchObject if it matches something or None if it doesn't.
re.sub is the regex-based replace. While it might be possible to use the string replace (with other functions), it will be clearer to do it this way.
I'm no regex expert (and it is 7am) so you'll have to figure out some_pattern and some_other_pattern used for the * case yourself, sorry!

Your aproach was the thing I tried in first place. It was even more cumbersome because: The 'some pattern' would mean to replace * not preceeded by ` by .*. (as *) is the right regex for this. now the problem is: What about all other characters that are possibly in my ss (i.e. ., [], \w`) and so on. They should be escaped as well. re.escape would not help, as it would escape everything. — ProfHase85, Jun 22 '14 at 8:33

t. fochtman · Answer 3 · 2014-10-20 20:06:38Z

Three modifications to the codeeval accepted script (numbered appropriately):

import sys
import re

# 1
def is_sorted(result):
    return all(result[i] <= result[i+1] for i in xrange(len(result)-1))

def is_ss(s, ss):
    search_exprs = [x.replace('\*', '*') for x in re.split(r'(?<!\\)\*', ss)]
    # 2
    result = [s.find(x) for x in search_exprs]

    # 3
    if -1 not in result and is_sorted(result):
        return True
    else:
        return False

for line in open(sys.argv[1]).read().splitlines():
    if line:
        print is_ss(*line.split(','))

Mod 1

This is just a function that checks if result is sorted. Who cares if result is sorted? Check Mod 2 for explanation.

Mod 2

s.find(x) returns the index of substring x in string s. If x doesn't occur in s -1 is returned. So referring back to Mod 1, if your search expressions appear sequentially in s then result is sorted.

Mod 3

A slight tweak to the conditional statement, n.b.d.

asked	2 years ago
viewed	191 times
active	2 years ago

current community

your communities

more stack exchange communities

More Pythonic version of finding substring with one glob-like expression

3 Answers 3

Language interpretation

Search technique

Driver

Suggested solution

Your Answer

Not the answer you're looking for? Browse other questions tagged python strings regex escaping or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

More Pythonic version of finding substring with one glob-like expression

3 Answers 3

Language interpretation

Search technique

Driver

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python strings regex escaping or ask your own question.

Related

Hot Network Questions