Finding the longest common subsequence algorithm using hash table

Question

I've designed an algorithm to find the longest common subsequence. This is how it works:

Initially i = 0

Picks the first letter from the first string starting from the ith letter.
Goes to the second string looking for that picked letter.
If not found returns to the first string and picks the next letter and repeats 1 to 3 until it finds a match in the second string.
Now that found a common letter in the second string, adds it to common_subsequence.
Stores its position in index.
Picks next letter from the first string and do step 2 but this time starts from index.
Repeats 3 to 6 until it reaches to the end of string 1 or string 2.
If the length of common_subsequence is greater than the length of common subsequence that found so far, assigns common_subsequence to lcs.
Increment the value of i and repeats 1 to 9 until i is equal to the length of the first string.

Here is an example:

X=A, B, C, B, D, A, B‬‬  
‫‪Y=B, D, C, A, B, A‬‬

First picks A.
Looks for A in Y.
Now that found A adds it to the end of common_subsequence.
Picks B from X.
Looks for B in Y but this time starts searching from A.
Picks C. It dosen't exist in string 2, so picks the next letter in X that is B.
...
...
...

The complexity of this algorithm is theta(n*m).

I implemented it on two methods. The second one uses a hash table, but after implementing I found that it's much slower compared to the first algorithm. I can't understand why.

Here is my implementation:

First algorithm:

import time
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if string is empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        start = 0 # start position in ystr
        for item in xstr[i:]:
            index = ystr.find(item, start) # position at the common letter
            if index != -1: # if common letter is found
                cs += item # add common letter to the cs
                start = index + 1
            if index == len(ystr) - 1: break # if reached to the end of ystr
        # updates lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

Second one using hash table:

import time
from collections import defaultdict
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if strings are empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    location = defaultdict(list) # keeps track of items in the ystr
    i = 0
    for k in ystr:
        location[k].append(i)
        i += 1
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        index = -1
        reached_index = defaultdict(int)
        for item in xstr[i:]:
            for new_index in location[item][reached_index[item]:]:
                reached_index[item] += 1
                if index < new_index:
                    cs += item # add item to the cs
                    index = new_index
                    break
            if index == len(ystr) - 1: break # if reached to the end of ystr
        # update lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

Winston Ewert · Accepted Answer · 2013-01-09 20:31:38Z

up vote 3 down vote accepted

Firstly, your algorithm is incorrect try:

lcs("AAAABCC","AAAACCB"), the LCS should be "AAAACC", but your algorithm finds "AAAAB".

Secondly your algorithm is O(n^2*m) not O(n*m). Since you don't elaborate as to why you think your algorithm is theta(n*m) I can't really guess where your analysis has gone wrong.

Your second version attempts to optimize the process of searching through the string by using a list of pre-calculated positions. This means you don't have to scan through all the positions in the string with different characters. However, you lose the ability to skip all the position before your starting index. For long strings with few distinct characters, you end up losing out.

answered Jan 9 '13 at 20:31

Winston Ewert
18.4k41945

I made a little changes to my algorithm. now it passes your test case. i uploaded it here: pastebin.com/030Uhpcr .the only change that i made is that it calls the function two time. first lcs(xstr, ystr) and second lcs(ystr, xstr). but i still think that its complexity is theta(n*m). because it loops through second string n times. – Sajjad Rastegar Jan 10 '13 at 13:46

@Rastegar, your algorithm is still incorrect, try: "AAAABCCD" and "AAAADCCB". As for complexity, ystr.find is called n*(n/2) times, or O(n^2). The complexity of ystr.find is O(m), thus the cost is O(n^2*m). It doesn't loop through the second string m times, because you've got two nested for loops there, not one. – Winston Ewert Jan 10 '13 at 14:14

Ok, It seems that my program certainly made fail. but it's complexity was theta(n*m) becuase ystr.find(item, start) doesn't start searching from the beginning of the list but it starts from start where it found the common letter in the last searching. and after getting the end of ystr, exits from the second loop. – Sajjad Rastegar Jan 10 '13 at 15:55

@Rastegar, ok I missed a subtlety in your algorithm. I thought start was being reset more then it was. So yes it appears to be theta(n*m) but that's all moot because it doesn't work. – Winston Ewert Jan 10 '13 at 18:11

add comment

asked	1 year ago
viewed	1299 times
active	13 days ago

current community

your communities

more stack exchange communities

Finding the longest common subsequence algorithm using hash table

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python algorithm or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Finding the longest common subsequence algorithm using hash table

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python algorithm or ask your own question.

Related

Hot Network Questions