Finding longest phrase algorithm

Question

I have this task to find the longest identical phrase which appears in 2 given files. With small files (~500KB each) it works just fine, ~10 secs to finish. But with larger files (~5MB each) it took ~43 mins to complete. The text in larger files are just copied and pasted over and over again. And I want to improve that if possible.

public static void LongestPhrase(string Book1, string Book2, ref string Phrase, ref int WIndex1, ref int WIndex2)
{
    string[] Words1 = Book1.Split('-');
    string[] Words2 = Book2.Split('-');
    string Fragment = string.Empty;
    int Counter1 = 0, Counter2 = 0;

    Dictionary<string, List<int>> Index1 = new Dictionary<string, List<int>>();
    Dictionary<string, List<int>> Index2 = new Dictionary<string, List<int>>();
    List<string> Repeated = new List<string>();
    List<string> Word1 = new List<string>();
    List<string> Word2 = new List<string>();

    foreach(string Nfo in Words1)
    {
        if (!Word1.Contains(Nfo.ToLower()))
            Word1.Add(Nfo.ToLower());
    }

    foreach (string Nfo in Words2)
    {
        if (!Word2.Contains(Nfo.ToLower()))
            Word2.Add(Nfo.ToLower());
    }

    foreach(string Nfo in Word1)
    {
        if (Word2.Contains(Nfo.ToLower()))
            Repeated.Add(Nfo.ToLower());
    }


    for (int i = 0; i < Words1.Length; i++)
    {
        if (Repeated.Contains(Words1[i].ToLower()))
        {
            if (!Index1.ContainsKey(Words1[i].ToLower()))
            {
                Index1.Add(Words1[i].ToLower(), new List<int>());
                Index1[Words1[i].ToLower()].Add(i);
            }
            else
                Index1[Words1[i].ToLower()].Add(i);
        }
    }

    for (int x = 0; x < Words2.Length; x++)
    {
        if (Repeated.Contains(Words2[x].ToLower()))
        {
            if (!Index2.ContainsKey(Words2[x].ToLower()))
            {
                Index2.Add(Words2[x].ToLower(), new List<int>());
                Index2[Words2[x].ToLower()].Add(x);
            }
            else
                Index2[Words2[x].ToLower()].Add(x);
        }
    }

    foreach (string Nfo in Repeated.OrderByDescending(a => a.Length))
    {

        foreach (int Inf in Index1[Nfo])
        {
            Counter1 = Inf;

            foreach (int Infs in Index2[Nfo])
            {
                Counter2 = Infs;
                int Start1 = 0, Start2 = 0;

                Fragment = Words1[Counter1];

                for (Start1 = Counter1 + 1, Start2 = Counter2 + 1; Start1 < Words1.Length && Start2 < Words2.Length; Start1++, Start2++)
                {
                    if (Words1[Start1].ToLower() == Words2[Start2].ToLower())
                        Fragment += " " + Words1[Start1];
                    else
                        break;
                }

                if (Fragment.Length > Phrase.Length)
                {
                    Phrase = Fragment;
                    WIndex1 = Counter1;
                    WIndex2 = Counter2;
                }
            }
        }
    }
}

static void Main(string[] args)
{
    var watch = Stopwatch.StartNew();
    Console.WriteLine("A");
    string Book1 = File.ReadAllText("Book1.txt", Encoding.GetEncoding(1257));
    string Book2 = File.ReadAllText("Book2.txt", Encoding.GetEncoding(1257));
    string UnfilteredBook1 = Regex.Replace(Book1, @"\s+", "-", RegexOptions.Singleline);
    string UnfilteredBook2 = Regex.Replace(Book2, @"\s+", "-", RegexOptions.Singleline);
    string Phrase = string.Empty; 
    int WordIndex1 = 0, WordIndex2 = 0;
    LongestPhrase(UnfilteredBook1, UnfilteredBook2, ref Phrase, ref WordIndex1, ref WordIndex2);

    watch.Stop();
    var elapsedMs = watch.ElapsedMilliseconds;
    Console.WriteLine("-> {0}", elapsedMs);
}

What I do in my function is this:

Make 2 lists from both books containing all different words
Make a list of words which are in both books
Find all positions of repeated words
Go through all repeated words through all found positions and search for phrase

I think 1-3 steps are all right and that 4th could be improved. I just don't have any ideas how. Should I change the approach to find longest phrase?

Dictionary<string, List<int>> is a bomb of inefficiency. Dictionaries are very nice for small sizes; but when dealing with relevant amounts of data they are not good. Needless to say that the List<int> part makes this version particularly aggressive. An almost-working-always way to notably improve the performance of a code dealing with relevant amounts of data is replacing all the dictionaries with more efficient collections (arrays); ideally, all the redundant (global) intermediate storage should be removed by properly analysing the algorithm. — varocarbas, 4 hours ago
Regex is quite slow and should be used only when strictly required (extraction of text following a complex pattern). Most of alternatives for simplistic actions (e.g., Split, Replace or LINQ methods) are notably faster. — varocarbas, 4 hours ago
If you are already consuming too much memory, you should avoid things like ReadAllText. The stream alternative (StreamReader in this case) might need some additional lines of code, but would be much safer and, in this context (where apparently memory is already being over-used), would contribute to improve the performance. — varocarbas, 4 hours ago
All these are advices at first sight (after a quick look at the code; without properly analysing/understanding it); also it seems that the part of the loops can be highly improved: too many consecutive different loops performing too similar actions doesn't seem right. I hope that any of this helps. — varocarbas, 4 hours ago

Avner Shahar-Kashtan · Answer 1 · 2015-11-20 22:09:44Z

The first thing that leaped to my eye as that you're using List<string> for huge strings and calling Contains on them. This is very bad for performance, because List<>'s search is an O(n) operation - to check if an item exists, it has to go linearly through the entire collection until it's found.

The data structure you want to be using is HashSet<string>, where checking for the existence of a given string is an O(1) operation, on average. It makes all set operations faster. Note that it can accept a StringComparer object to make it case insensitive, which saves you having to call ToLower on every words, which also slows you down - each ToLower call creates a new String object in memory, which, for large books, will cause a lot of memory pressure.

So the first part of your method can be expressed this way:

string[] Words1 = Book1.Split('-');
string[] Words2 = Book2.Split('-');

// load Book1
HashSet<string> uniqueRepeatedWords 
  = new HashSet<string>(Words1, StringComparer.CurrentCultureIgnoreCase); 

// keep only those in Book2 too.
uniqueRepeatedWords.IntersectWith(Words2);

Now the Find Positions loops, which can be made more memory-efficient, and mostly remove duplicate code to make things clearer:

// Notice I use a case-insensitive comparer instead of constant ToLower,
// and save the current word once to a local var instead of the 
// noisier and less clear array access every time.
// Additionally, instead of having the same line in both if and else,
// I just create the new List<int> if it doesn't exist, and 
// go back to the same incrementing code after.
var positionsInBook1 = 
   new Dictionary<string, List<int>>(StringComparer.CurrentCultureIgnoreCase);
for (int i = 0; i < Words1.Length; i++)
{
    var word = Words1[i];
    if (uniqueRepeatedWords.Contains(word))
    {
        if (!positionsInBook1 .ContainsKey(word))
        {
            positionsInBook1.Add(word, new List<int>());
        }
        positionsInBook1[word].Add(i);
    }
}

// Now do the same find-position code for Book2 - ideally, move
// the code to different method and call it twice, with different      
// parameters.    
var positionsInBook2 = FindWordPositions(Words2, uniqueRepeatedWords);

General notes

Your variable naming conventions are confusing. It's customary to name local variables in lowercase (words1, not Words1), and it's very confusing to have variables called Words1 and Word1, both of which being lists of words. I would name them more explicitly - allWordsInBook1, for instance. Similarly, Inf and Infs are almost identical and very confusion. posInBook1 and posInBook2 might be clearer.
I'd suggest not defining your variables at the top of the method, but closer to where you use them. In your code, when you start using Index1, for instance, you have to scroll back a page to remember what it is.
You're not actually incrementing Counter1 and Counter3 anywhere, are you? They're always identical to Inf and Infs respectively? In that case, they're just adding visual noise and cognitive weight. Start1/Start2 should be named counter1/counter2, since they're the ones that are incremented.
Again, every time you find yourself check for equality with ToLower, replace it with string1.Equals(string2, StringComparison.CurrentCultureIgnoreCase). When dealing with a huge number of strings inside nested loops, this can have a real effect on memory usage.
In your Main method, you're reading a file, using Regex to replace whitespace with a hyphen, then splitting on the hyphen inside the method. Why not save a stage (and a copy of the whole string) by splitting on the regex directly? Regex.Split can do it easily:

var wordsInBook1 = Regex.Split(Book1, @"\s+|-");

Never knew HashSet type, modified my code to use it and time was cut down almost twice. Seems like really good thing, I will read more about it. As for variables names, yes I know they are confusing I write like that because I'm changing my code, I will change them in the future once I get to final version of code. — Rog, 21 hours ago
@Rog I'm still adding bits to my answer, so check up for more changes. :) — Avner Shahar-Kashtan, 21 hours ago
Really thank you :) As for Regex.Split I honestly never thought of that -.- — Rog, 20 hours ago

Dmitry · Answer 2 · 2015-11-21 01:46:41Z

The only approach I can suggest is to use an LCP array.
Here is my implementation.

Let's create a class that will hold a result:

public sealed class LongestCommonPhraseInfo
{
    public readonly string[] CommonPhraseWords;
    public readonly int WordIndex1;
    public readonly int WordIndex2;

    public LongestCommonPhraseInfo(string[] commonPhraseWords, int wordIndex1, int wordIndex2)
    {
        CommonPhraseWords = commonPhraseWords;
        WordIndex1 = wordIndex1;
        WordIndex2 = wordIndex2;
    }

    public string Phrase
    {
        get { return String.Join(" ", CommonPhraseWords); }
    }
}

The main class that creates a suffix array and the corresponding LCP array. It contains a private nested class Suffix and a public method GetLongestCommonPhrase:

public sealed class SuffixArray
{
    private sealed class Suffix : IComparable<Suffix>
    {
        public static string[] Words { get; internal set; }
        public static int TextMargin { get; internal set; }

        public readonly int StartIndex;

        public Suffix(int startIndex)
        {
            StartIndex = startIndex;
        }

        public int CompareTo(Suffix other)
        {
            if (ReferenceEquals(this, other))
                return 0;

            int i = 0;
            while (true)
            {
                int cmp = String.CompareOrdinal(Words[i + StartIndex], Words[i + other.StartIndex]);
                if (cmp != 0)
                    return cmp;
                i++;
            }
        }

        private static bool Between<T>(T x, T a, T b)
            where T : IComparable<T>
        {
            return a.CompareTo(b) <= 0
                       ? x.CompareTo(a) >= 0 && x.CompareTo(b) <= 0
                       : x.CompareTo(a) <= 0 && x.CompareTo(b) >= 0;
        }

        public int Lcp(Suffix other)
        {
            if (!Between(TextMargin, StartIndex, other.StartIndex))
                return 0;

            int i = 0;
            while (true)
            {
                int cmp = String.CompareOrdinal(Words[i + StartIndex], Words[i + other.StartIndex]);
                if (cmp != 0)
                    return i;
                i++;
            }
        }
    }

    private const string Sentinel1 = "\x00";
    private const string Sentinel2 = "\x01";
    private readonly Suffix[] Suffixes;
    private readonly string[] Words;
    private readonly int TextMargin;
    private readonly int[] Lcps;
    private readonly int MaxLcp;
    private readonly int MaxLcpIndex;

    public SuffixArray(string[] words1, string[] words2)
    {
        // Concat the words arrays
        Words = new string[words1.Length + words2.Length + 2];
        words1.CopyTo(Words, 0);
        words2.CopyTo(Words, words1.Length + 1);
        Words[words1.Length] = Sentinel1;
        Words[Words.Length - 1] = Sentinel2;

        TextMargin = words1.Length;

        // Set up static properties of `Suffix` class for the current task
        // (Very bad approach)
        Suffix.Words = Words;
        Suffix.TextMargin = TextMargin;

        // Create a suffix array
        Suffixes = new Suffix[Words.Length - 1];
        for (int i = 0; i < Suffixes.Length; i++)
        {
            Suffixes[i] = new Suffix(i);
        }
        Array.Sort(Suffixes);

        // Create an Lcp array and find the maxLcp
        Lcps = new int[Suffixes.Length];
        Lcps[0] = -1;
        int maxLcp = -1;
        int maxLcpIndex = -1;
        for (int i = 1; i < Suffixes.Length; i++)
        {
            int lcp = Suffixes[i].Lcp(Suffixes[i - 1]);
            Lcps[i] = lcp;
            if (lcp > maxLcp)
            {
                maxLcp = lcp;
                maxLcpIndex = i;
            }
        }

        MaxLcp = maxLcp;
        MaxLcpIndex = maxLcpIndex;
    }

    public LongestCommonPhraseInfo GetLongestCommonPhrase()
    {
        // Set up static properties of `Suffix` class for the current task
        // (Very bad approach)
        Suffix.Words = Words;
        Suffix.TextMargin = TextMargin;

        Suffix words1;
        Suffix words2;

        if (Suffixes[MaxLcpIndex].StartIndex < Suffix.TextMargin)
        {
            words1 = Suffixes[MaxLcpIndex];
            words2 = Suffixes[MaxLcpIndex - 1];
        }
        else
        {                                    
            words1 = Suffixes[MaxLcpIndex - 1];
            words2 = Suffixes[MaxLcpIndex];
        }

        string[] tmp = new string[MaxLcp];
        Array.Copy(Suffix.Words, words1.StartIndex, tmp, 0, tmp.Length);

        return new LongestCommonPhraseInfo(tmp, words1.StartIndex, words2.StartIndex);
    }
}

Usage sample:

var book1 = File.ReadAllText("50503-0.txt").ToLower();
var book2 = File.ReadAllText("50511-0.txt").ToLower();

string[] textWords1 = Regex.Split(book1, @"\s+|-+", RegexOptions.Singleline);
string[] textWords2 = Regex.Split(book2, @"\s+|-+", RegexOptions.Singleline);

Stopwatch sw = new Stopwatch();
sw.Start();
var suffixArray = new SuffixArray(textWords1, textWords2);
var longestPhrase = suffixArray.GetLongestCommonPhrase();
sw.Stop();

Console.WriteLine("Common Phrase Words Count: {0}", longestPhrase.CommonPhraseWords.Length);
Console.WriteLine("Common Phrase: \"{0}...\"", longestPhrase.Phrase.Substring(0, 64));
Console.WriteLine("Time elapsed (ms): {0}", sw.ElapsedMilliseconds);

Output (on my PC):

Common Phrase Words Count: 2999
Common Phrase: "updated editions will replace the previous one the old editions ..."
Time elapsed (ms): 2202

asked	today
viewed	124 times
active	today

current community

your communities

more stack exchange communities

Finding longest phrase algorithm

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged c# algorithm strings time-limit-exceeded or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Finding longest phrase algorithm

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c# algorithm strings time-limit-exceeded or ask your own question.

Related

Hot Network Questions