0

I recive 2 strings. I have to implement a recursive function that return the best option for them to be similar by adding -. I give an exemple cause this is not good description

str1="sdfsdf"
str2="dffd"

the best option will be

str1="sdfsdf-"
str2="-df--fd"

how do I implement this is recursion?? I dont know how to start thinking on it.

4
  • 3
    en.wikipedia.org/wiki/Levenshtein_distance might be a good start Commented Nov 16, 2013 at 9:58
  • You can't add - to sdfsdf to get sdfdf-, you also have to remove an s. Can you clarify exactly what you mean? Commented Nov 16, 2013 at 10:20
  • This isn't an easy task, just saying. Commented Nov 16, 2013 at 10:27
  • oh I made mistake. str1=sdfsdf- str2=-df--fd Commented Nov 16, 2013 at 10:29

3 Answers 3

1

Actually, this is a variant of the classic Longest Common Subsequence problem. You can solve this basic dynamic programming by first finding the LCS, and then adding '-' to unify the two given string. Recursive algorithm follows a top-down approach but often leads to low efficiency, while the non-recursive version could solve the basic LCS problem bottom up and helps you understand the core part of dynamic programming.

def lcs(a, b):
    m, n = len(a), len(b)
    stable = [[[0,'-'] for j in xrange(n+1)] for i in xrange(m+1)]
    for i in xrange(1, m+1):
        for j in xrange(1, n+1):
            lu, l, u = stable[i-1][j-1][0], stable[i][j-1][0], stable[i-1][j][0]
            stable[i][j][0] = (lu + 1) if a[i-1] == b[j-1] else max(l,u)
            if a[i-1] == b[j-1]:
                stable[i][j][1] = '`'
            elif l > u:
                stable[i][j][1] ='<'
            else:
                stable[i][j][1] ='^'

    i, j = m, n
    similara, similarb = list(a), list(b)
    while i > 0 and j > 0:
        if stable[i][j][1] == '`':
            lcstring = a[i-1] + lcstring
            i -= 1
            j -= 1
        elif stable[i][j][1] == '<':
            similara.insert(i, '-')
            j -= 1
        elif stable[i][j][1] == '^':
            similarb.insert(j, '-')
            i -= 1
    while len(similara) < len(similarb):
        similara.insert(0,'-')
    while len(similarb) < len(similara):
        similarb.insert(0,'-')
    return stable[len(a)][len(b)][0], lcstring,''.join(similara), ''.join(similarb)

you can get the length of LCS, the LCS itself, the 2 converted strings by the above function

>>>print lcs('sdfsdf', 'dffd')
3, 'dfd', sd-fsdf, -dff-d-
0

It is exactilly the sequence comparision question in biology. The only difference is in DNA we only have A T C G and -, anyway the algo is the same. google it and you will find a lot of answers. It is one of the fundamental questions in Bioinfomatics and has been thoroughly studyed.

0

You can solve this using, for example, dynamic programming (although there are more efficient algorithms, some of which are described in the article on the Longest Common Subsequence Problem on wikipedia).

Here's a simple but inefficient recursive solution. Note that in the worst case it runs in time exponential to the size of the input strings.

def match(a, b, prefa='', prefb='', cost=0):
    if not a:
        return cost + len(b), prefa + '-' * len(b), prefb + b
    if not b:
        return cost + len(a), prefa + a, prefb + '-' * len(a)
    if a[0] == b[0]:
        return match(a[1:], b[1:], prefa + a[0], prefb + b[0], cost)
    return min([match(a[1:], b, prefa + a[0], prefb + '-', cost+1),
                match(a, b[1:], prefa + '-', prefb + b[0], cost+1)])

It outputs the cost (the number of '-'), as well as the diffed strings.

>>> print match('sdfsdf', 'dffd')
(4, 'sdf-sdf', '-dff-d-')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.