Largest sequence found in all strings

Question

**Note: I might put this on Code Golf too, but first I'd like to get this code reviewed.

So there's this programming contest my school will be holding soon, which is about solving problems in the shortest (writing) time possible. I want to participate, but I discovered that I suck with puzzles. So today I begun doing a few example problems from past competitions.

Problem Description

A genome is a sequence of characters (A,C,G or T). Sometimes, many genomes share the same sequence of characters:

ACCC GTT
A GTT AAC

Given a set of genomes, you will have to find the size of the largest character sequence that is repeated in all given genomes.

Input

The first line is an integer, which just tells you the number of genomes you will be given. The next lines are the genome strings themselves (one per line).

Example:

2
ACGGGCGTCGTCCCCGTCGTCGTATC
CTCGTCGTCCCCGTCGTCGTGTC

Output

Integer representing the size of the largest character sequence you found in all given genomes.

Example (based on previous input):

My Analysis

Not gonna lie: I'm not good with clever math voodoo. I just decided to create all possible sequences and then grab the largest from that is found in all genomes. The key to that was using std::string::find(), which I guess is not the brightest idea ever.

My Solution You may notice that my variables and such are in Spanish. Yeah.

#include <iostream>
#include <deque>

using namespace std;

int main() {
    // Read data
    int cantidad; cin >> cantidad; // Number of genomes
    deque<string> genomas;         // Deque of genomes
    string menorString;            // Smallest genome
    for (int i = 0; i < cantidad; ++i) {
        string in; cin >> in;      // Read a genome
        genomas.push_back(in);     // Send to deque
        // Check if this is the smallest genome so far
        if (i == 0) {
            menorString = in;
        }else {
            if (in.length() < menorString.length()){
                menorString = in;
            }
        }
    }

    // Create all possible sequences
    deque<string> sequences;       // Deque for sequences
    unsigned long size = menorString.length();
    // We will use the smallest genome found
    for (int a = 0; a < size; ++a) {
        // For each char in this genome...
        // We begin building a sequence using this char.
        string temp = string(1,menorString.at(a));
        // And store the sequence
        sequences.push_back(temp);
        for (int b = a + 1; b < size; ++b) {
            // For each character AFTER this character...
            // We append it to our current sequence...
            temp.append(1,menorString.at(b));
            // And store it.
            sequences.push_back(temp);
        }
    }

    // Now time to find the largest sequence found in ALL genomes.
    string largestSequence;
    for (int i = 0; i < sequences.size(); ++i) {
        // For each sequence...
        bool genomas_ok = true;
        for (int g = 0; g < genomas.size() && genomas_ok; ++g) {
            // We check all genomes...
            string genoma = genomas[g];
            // We search for the sequence to be within the genome
            if (genoma.find(sequences[i]) == string::npos) {
                // Sequence not found in genome. Discard this sequence.
                genomas_ok = false;
            }
        }
        if (genomas_ok) {
            // All genomes do have this sequence!
            // Check if is the largest one we got so far...
            if (i == 0) {
                largestSequence = sequences[i];
            }else{
                if (sequences[i].size() > largestSequence.size()) {
                    largestSequence = sequences[i];
                }
            }
        }
    }
    // Print size of largest genome.
    cout << largestSequence.size();
    return 0;
}

Does it work?

Well, it seems to work. It works for the example given above, and with a few of my own tests.

What I need

This code looks pretty large. Not good for a speed contest (I suppose). Can you help me figure out improvements for this? Namely, improvements that make the code smaller - not necessarily about efficiency (but it is appreciated as well!).

I suppose that commenting almost every line is not good for a contest either. Haha...

Matt Johnson · Answer 1 · 2013-05-04 02:19:11Z

The book programming perls has one of the most clever implemenations to find the longest repeated substring. He sorts an array of pointers to the substrings. This would be the place that I would start for an implementation.

/* From 'Programming Pearls' by Jon Bentley */
/* longdup.c -- Print longest string duplicated M times */

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

int pstrcmp(char **p, char **q)
{   return strcmp(*p, *q); }

int comlen(char *p, char *q)
{   int i = 0;
    while (*p && (*p++ == *q++))
        i++;
    return i;
}

#define M 1
#define MAXN 5000000
char c[MAXN], *a[MAXN];

int main()
{   int i, ch, n = 0, maxi, maxlen = -1;
    while ((ch = getchar()) != EOF) {
        a[n] = &c[n];
        c[n++] = ch;
    }
    c[n] = 0;
    qsort(a, n, sizeof(char *), pstrcmp);
    for (i = 0; i < n-M; i++)
        if (comlen(a[i], a[i+M]) > maxlen) {
            maxlen = comlen(a[i], a[i+M]);
            maxi = i;
        }
    printf("%.*s\n", maxlen, a[maxi]);
    return 0;
}

asked	5 months ago
viewed	90 times
active	5 months ago

Largest sequence found in all strings

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged c++ learning contest-problem or ask your own question.

Largest sequence found in all strings

1 Answer

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++ learning contest-problem or ask your own question.

Related