Longest common subsequence (LCS) for multiple strings

Question

How would you improve this code? Particularly the check and check_all functions?

The time complexity of the algorithm of mlcs is \$O(|\Sigma|MN)\$, where \$\Sigma\$ is the alphabet, M is the number of strings and N is the length of the strings. Is that right?

My analysis: candidates() performs \$O(|\Sigma|M)\$ operations and is called \$O(N)\$ times.

Based on the reviewed code posted before at Multiple longest common subsequence (another algorithm)

def check(string, seq):
    i = 0
    j = 0
    while i < len(string) and j < len(seq):
        if string[i] == seq[j]:
            j += 1
        i += 1
    return len(seq) - j

def checkall(strings, seq):
    for x in strings:
        a = check(x, seq)
        if not a == 0:
            print(x, seq, a)
            return False
    return True

def mlcs(strings):
    """Return a long common subsequence of the strings.
    """
    if not strings:
        raise ValueError("mlcs() argument is an empty sequence")
    strings = list(set(strings)) # deduplicate
    alphabet = set.intersection(*(set(s) for s in strings))

    # indexes[letter][i] is list of indexes of letter in strings[i].
    indexes = {letter:[[] for _ in strings] for letter in alphabet}
    for i, s in enumerate(strings):
        for j, letter in enumerate(s):
            if letter in alphabet:
                indexes[letter][i].append(j)

    # Generate candidate positions for next step in search.
    def candidates():
        for letter, letter_indexes in indexes.items():
            candidate = []
            for ind in letter_indexes:
                if len(ind) < 1:
                    break
                q = ind[0]
                candidate.append(q)
            else:
                yield candidate

    result = []
    while True:
        try:
            # Choose the closest candidate position, if any.
            pos = None
            for c in candidates():
                if not pos or sum(c) < sum(pos):
                    pos = c
            letter = strings[0][pos[0]]
        except TypeError:
            return ''.join(result)
        for let, letter_indexes in indexes.items():
            for k, ind in enumerate(letter_indexes):
                ind = [i for i in ind if i > pos[k]]
                letter_indexes[k] = ind
        result.append(letter)

strings = []
# Alphabet for the strings.
sigma = ["a", "b", "c", "d"]
# Alphabet for the LCS.
sigmax = ["e", "f", "g"] 
import random
Nx = 67 # Length of LCS.
N = 128 # Length of strings. N >= Nx.
M = 128 # Number of strings.
x = ""
for _ in range(Nx):
    x += random.choice(sigmax)
for _ in range(M):    
    string = ""
    for _ in range(N):
        string += random.choice(sigma)
    indexes = list(range(N))
    random.shuffle(indexes)
    indexes = sorted(indexes[:len(x)])
    for j in range(len(x)):
        string = string[:indexes[j]]+x[j]+string[indexes[j]+1:]
    strings += [string]

#strings = ["abbab", "ababa", "abbba"]
#strings = ["abab", "baba", "abaa"]
#strings = ["bacda", "abcde", "decac"]
#strings = ["babbabbb", "bbbaabaa", "abbbabab", "abbababa"]
#strings = ["ab", "aba"]
#print("Strings:")
#print(strings)
l = mlcs(strings)
print("LCS:")
print(l, len(l), checkall(strings, l))

Joe Wallis · Accepted Answer · 2015-11-30 22:18:09Z

Conventions

It's highly recommended to have all imports at the top of your file. It's not nice for readers to have a different environment than expected.

It's also better to be safe than sorry, so you may want to wrap your global code in a if __name__ == '__main__':.

One-line docstrings by convention have both """ on the same line. You must be thinking of multi-line docstrings, where the last """ goes on a new line.

Code

`checkall`

Python has some strong readability conventions. For example, in checkall you use the not operator after the equals operator. This is strange, as you could just use the not equals operator. But in Python if a != 0: is always true and false every time if a: is. And so it's better to just use if a.

# Using two operators rather than one
if not a == 0:
    ...

# Comparing to value (twice).
if a != 0:
    ...

# What you want to be doing:
if a:
    ...

I like the idea of writing declarative statements from functional programming, which can really help readability in checkall.

If we were to say what the algorithm is:
Go through the list of strings, if any of them after going through check are true return false, otherwise return true.

Which can be converted to:

def checkall(strings, seq):
    return not any(check(x, seq) for x in strings)

`check`

A bit of a 'out there' one, even though it's harder to think backward in algorithms, or at least for me. Changing check to run backwards can improve readability. Rather than doing n < len(...) you can just do n >= 0.

It's not that big a difference, so you may not want to think about doing it.
But it will save a few cycles on all the lens.

`mlcs`

poor variable names such as s, make reading a little harder, s is short for string, but could be super, special or soup. It's much easier to comprehend enumerate(string). In changing it it will also not be hidden in the other one letter variables.

this lead to me not quickly understanding what you are doing in the bits after candidates.

There are other problems, candidates can be put in the for loop, rather than as a function. And you should try to reduce the 8 for-loops that the function has.

Janne Karila · Answer 2 · 2015-12-01 20:19:22Z

Concerning check and checkall, I would change function names to describe what they are checking, and iterate over the characters using for loops:

def is_subsequence(string, seq):
    string_iter = iter(string)
    for seq_char in seq:
        for string_char in string_iter:
            if seq_char == string_char:
                break
        else: # ran out of string
            return False
    return True

def is_common_subsequence(strings, seq):
    return all(is_subsequence(string, seq) for string in strings)

SuperBiasedMan · Answer 3 · 2015-11-30 22:17:13Z

You have a strange test for not zero. Instead of this

if not a == 0:

people usually use the not equal to operator, !=. You could alternatively use truthiness. In Python, an integer interpreted as a boolean is False for 0 and True for any other number, so you could use either of these cases:

if a != 0:
if a:

asked	9 months ago
viewed	259 times
active	9 months ago

current community

your communities

more stack exchange communities

Longest common subsequence (LCS) for multiple strings

3 Answers 3

Conventions

Code

`checkall`

`check`

`mlcs`

Your Answer

Not the answer you're looking for? Browse other questions tagged python algorithm strings python-3.x or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Longest common subsequence (LCS) for multiple strings

3 Answers 3

Conventions

Code

checkall

check

mlcs

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python algorithm strings python-3.x or ask your own question.

Linked

Related

Hot Network Questions

`checkall`

`check`

`mlcs`