Largest common subsequence algorithm in C++

Question

I'm relatively new to C++ and am trying to learn good programming practices and habits from the beginning. The code I wrote finds the largest common subsequence between to strings (and output the resulting subsequence, not the length of the subsequence).

I was wondering if there were any improvements I could make to make this more efficient. I know that it is exponential run-time, and there is a more efficient approach via dynamic programming, but I am simply interested in improving this specific algorithm (i.e. the recursive approach). For instance, I know that maybe concatenating my strings via += may be expensive, or maybe there is a way to use references to avoid the copying of strings.

#include <string>
#include <iostream>

// utility function to return the largest (wrt length) of two strings
std::string& max(std::string &s1, std::string &s2)
{
    return s1.size() > s2.size() ? s1 : s2;
}

// Return the lowest common sequence of characters contained in the
// strings s1 and s2 (as a string)
// e.g. lcs("ACBEA", "ADCA") -> "ACA" (string objects, not const char *)
std::string lcs(std::string s1, std::string s2)
{
    // initialize the lcs to be an empty string
    std::string lcsStr;

    if (s1.empty() || s2.empty())
        return std::string(); // empty string constructor

    if (s1.at(s1.size() - 1) == s2.at(s2.size() - 1))
    {
        lcsStr.push_back(s1.at(s1.size() - 1));
        s1.pop_back();
        s2.pop_back();
        return lcs(s1, s2) += lcsStr;
    }

    else
        return max(
            lcs(s1.substr(0, s1.size() - 1), s2), lcs(s1, s2.substr(0, s2.size() - 1))
        ) += lcsStr;
}


int main()
{
    std::cout << "Enter two strings:" << std::endl;
    std::string s1, s2;
    std::cin >> s1 >> s2;

    std::cout << lcs(s1, s2) << std::endl;

    return 0;
}

Toby Speight · Accepted Answer · 2024-09-14 16:41:22Z

The function max() is easily confused with std::max() which has different behaviour (it returns the lexicographically later of the two arguments) and may be found by argument-dependent lookup from std::string. That's exactly what happens for me in lcs() where we attempt to call it with two rvalues. I suggest calling this function longest() instead to avoid that problem.

We need to be careful when we use it, because it accepts and returns mutable references. So we need to ensure that the referent outlives the result and that modifying through the reference will be acceptable.

The comment // Return the lowest common sequence of characters seems to have a mistake - shouldn't "lowest" be "longest"? It's slightly inaccurate to suggest that there's a single longest common sequence, since there may be several of the same length.

s1.at(s1.size() - 1) is more readable if written s1.back().

lcsStr only ever holds a single character (or none). So we could replace it with a char variable.

In main(), we prompt for two strings, but what we actually read is two words (i.e. non-empty and not containing whitespace).

We don't check whether the read was successful before continuing with possibly invalid data.

Here's a lightly-modified version that fixes the above issues:

#include <string>

// utility function to return the longest of two strings
std::string longest(std::string s1, std::string s2)
{
    return s1.size() < s2.size() ? s2 : s1;
}

// Return a longest common sequence of characters contained in the
// strings s1 and s2.
// e.g. lcs("ACBEA", "ADCA") -> "ACA"
std::string lcs(std::string s1, std::string s2)
{
    if (s1.empty() || s2.empty()) {
        return {};
    }

    if (s1.back() == s2.back())
    {
        char last = s1.back();
        s1.pop_back();
        s2.pop_back();
        return lcs(s1, s2) + last;
    } else {
        return longest(lcs(s1.substr(0, s1.size() - 1), s2),
                       lcs(s1, s2.substr(0, s2.size() - 1)));
    }
}

#include <cstdlib>
#include <iostream>

int main()
{
    std::cout << "Enter two words:" << std::endl;
    std::string s1, s2;
    std::cin >> s1 >> s2;
    if (!std::cin) {
        std::cerr << "Failed to read input\n";
        return EXIT_FAILURE;
    }

    std::cout << lcs(s1, s2) << '\n';
}

With those easy changes made, we can then look at performance. We may get slightly faster code using string views rather than copying full strings around, but our biggest problem is that the recursive invocation of lcs() twice in the else path causes our runtime to roughly double for each character we add to input. That means that although finding the longest common subsequence of ABCDEFGHIJKLM and NOPQRSTUVWXYZ runs in under a second, doing the same for ABCDEFGHIJKLMNOPQRSTUVWXYZ and abcdefghijklmnopqrstuvwxyz doesn't complete in the full minute I allow it (and would likely take around 100 minutes).

One simple improvement is to observe that there's no point even considering characters that don't appear in the other string:

    } else {
        while (s2.find(s1.back()) == s2.npos) {
            s1.pop_back();
            if (s1.empty()) { return {}; }
        }
        while (s1.find(s2.back()) == s1.npos) {
            s2.pop_back();
            if (s2.empty()) { return {}; }
        }
        return longest(lcs(s1.substr(0, s1.size() - 1), s2),
                       lcs(s1, s2.substr(0, s2.size() - 1)));
    }

This makes the problem case above (no characters in common) run effectively instantly, and reduces my next pathological input - ABCDEFGHIJKLMNOPQRSTUVWXYZ and ZYXWVUTSRQPONMLKJIHGFEDCBA from about an hour down to 5 seconds.

Stack Exchange Network

Largest common subsequence algorithm in C++

1 Answer 1

Your Answer

Hot Network Questions

Largest common subsequence algorithm in C++

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions