8
\$\begingroup\$

I have a function that loops through an array of strings containing approximately 3 million lines of strings. My problem is this loop is running too slowly and I want to know how to make it faster.

string[] words = { "Hello", "world", "!", "My", "name", "is", "Something", "."};
    List<MyClass> result = new List<MyClass>();

    for (int i = 0; i < words.Length; i++)
    {
        if (words[i] == "&&&")
        { }
        else if (words.Length > i + 2 && FileContains(words[i] + " " + words[i + 1] + " " + words[i + 2]))
        {
            result.Add(new MyClass(words[i] + " " + words[i + 1] + " " + words[i + 2]));
            words[i + 1] = "&&&";
            words[i + 2] = "&&&";
        }
        else if (words.Length > i + 1 && FileContains(words[i] + " " + words[i + 1]))
        {
            result.Add(new MyClass(words[i] + " " + words[i + 1]));
            words[i + 1] = "&&&";
        }
        else
            result.Add(new MyClass(words[i], Parent));

    }

And my FileContains function which loops through the array of strings:

public static bool FileContains(string Text)
{
    foreach (string line in ARRAYOFWORDS)  // Approximately 3 million strings
    {
        if (Text.ToLower() == line.Substring(0, line.LastIndexOf('|')))
        {
            return true;
        }
    }
    return false;
}
\$\endgroup\$
0

7 Answers 7

12
\$\begingroup\$

What is this for? It looks like a string search function. And you are storing the found results in an output.

As such, I recommend not using an array at all. You should make use of string.IndexOf, which uses Boyer Moyer Search Algorithm under the hood, very fast.

Instead just make one BIG string with line breaks, for example...

string stringToSearch = "some line here\nanother line here\nand another line here\netc etc etc";

int matchIndex = 0;
while (matchIndex > -1)
{
    matchIndex = stringToSearch.IndexOf("string to search for");
    if (matchIndex > -1)
    {
        //use string functions to get matched word
    }
}

Something like that for search a string, would be astronomically faster.

You can load a file into a string in one move. The TextReader object in the System.IO name space has a method called ReadToEnd which returns the entire file as a string with lines separated by new line characters "\n".

And, if you are dealing with a binary file, you can still search it string.IndexOf. Just convert the binary file into a string with ASCIIEncoding.GetString(someByteArray), then convert your search for bytes to a string the same way. Then search for the search bytes string in the converted file string.

\$\endgroup\$
0
21
\$\begingroup\$

Well you can optimize FileContains very simply - stop calling Text.ToLower() on every iteration!

In fact, at that point it's really simple to rewrite. Here's the code for the original question, which didn't have the Substring part:

public static bool FileContains(string text)
{
    // Variable name changed to save my eyes from the horror.
    return lines.Contains(text.ToLower());
}

If you store the lines in a HashSet<string> instead of an array or a list, it will be blazingly fast.

With the Substring part, I'd create a new HashSet<string> with just those substrings, assuming you want to look up more than once:

HashSet<string> lineStarts;

...

lineStarts = new HashSet<string>(lines.Select(x => x.Substring(0, x.IndexOf('|'));

Then use lineStarts within FileContains.

Note that using ToLower is a fairly crude approach to case sensitivity. You should think carefully about what this needs to do in terms of culture-sensitivity etc. See the relevant MSDN page for lots more information. (Ain't culture fun?)

\$\endgroup\$
0
6
\$\begingroup\$

I can see a few issues here:

  1. You are searching the array of lines for a string, then searching it again for a substring, on each iteration. This is unnecessary. Search for lines beginning with the substring, and for each, see if they match match either the longer or shorter string completely.

  2. Since c# strings are immutable, any string transformation such as ToLower() or SubString() necessarily makes a transformed copy of the string. In tight inner loops this can kill performance, so don't do this.

  3. Consider using a HashSet<string> for your array of text. You will need to add the substring before the last '|' for each line. However, it's possible that building the HashSet will take longer than the searches, since you don't actually run many searches.

  4. Consider parallelizing the searches -- though this is a little tricky, since a match on a preceding loop changes the logic on later loops (by writing the "&&&" characters into the word array).

  5. Choose the right StringComparison for your needs.

  6. line.Substring(0, line.LastIndexOf('|')) will throw an exception if the '|' character is not found. Do you want this?

  7. You lowercase the search strings but not the file lines, so if the file lines have any uppercase letters the searches will fail. Is this desired?

Here's a prototype fix for 1, 2, 5 and 6. It's a little gross but should be faster. I also renamed your string array to FileLines:

    string[] words = { "Hello", "world", "!", "My", "name", "is", "Something", "." };
    List<MyClass> result = new List<MyClass>();
    StringComparison comparisonType = StringComparison.CurrentCulture; // Choose wisely.

    for (int i = 0; i < words.Length; i++)
    {
        if (words[i] == null)
            continue;
        var words3 = (words.Length > i + 2 ? words[i] + " " + words[i + 1] + " " + words[i + 2] : null);
        var words2 = (words.Length > i + 1 ? words[i] + " " + words[i + 1] : null);
        var lower3 = (words3 != null ? words3.ToLower() : null);
        var lower2 = (words2 != null ? words2.ToLower() : null);
        bool add3 = false;
        bool add2 = false;

        if (lower2 != null)
        {
            var length2 = lower2.Length;
            var length3 = (lower3 != null ? lower3.Length : -1);
            foreach (var possibleMatch in FileLines.Where(s => s.StartsWith(lower2, comparisonType)))
            {
                if (lower3 != null && possibleMatch.StartsWith(lower3, comparisonType) && MatchesLength(possibleMatch, length3))
                {
                    add3 = true;
                    break;
                }
                else if (!add2 && MatchesLength(possibleMatch, length2))
                {
                    add2 = true;
                    if (lower3 == null)
                        break;
                    // Otherwise do not break!  Continue to look for 3-word matches
                }
            }
        }

        if (add3)
        {
            result.Add(new MyClass(words3));
            words[i + 1] = null;
            words[i + 2] = null;
        }
        else if (add2)
        {
            result.Add(new MyClass(words2));
            words[i + 1] = null;
        }
        else
        {
            result.Add(new MyClass(words[i], Parent));
        }
    }

And then

    static bool MatchesLength(string possibleMatch, int length) 
    {
        // Ignore anything after the last '|', if present
        return (possibleMatch.Length == length || possibleMatch.LastIndexOf('|') == length);
    }
\$\endgroup\$
3
\$\begingroup\$

Use parallel LINQ, with string equals:

public static bool FileContains(string Text)
{
    return ARRAYOFWORDS
    .AsParallel()
    .Any(x => string.Equals(x.Substring(0, line.LastIndexOf('|')), Text,
    StringComparison.InvariantCultureIgnoreCase));
}
\$\endgroup\$
3
\$\begingroup\$

A few optimisation that can be done:

  1. This:

    if (words[i] == "&&&"){ }
    

    Since you ignore &&& and otherwise set the value to &&& in the else part, you can increment i to i+2/1 (as per the else if block) to skip some iterations.

    Just incrementing would save you the effort of setting the word to &&& i.e

    if (words[i] == "&&&")
        { }
        else if (words.Length > i + 2 && FileContains(words[i] + " " + words[i + 1] + " " + words[i + 2]))
        {
            result.Add(new MyClass(words[i] + " " + words[i + 1] + " " + words[i + 2]));
            i=i+2;
        }
    
  2. Extracting Text.ToLower() before the loop.

  3. Compare the string length before string comparison.
\$\endgroup\$
0
2
\$\begingroup\$

You could build a hashset of your word collection and take advantage of a constant time look up to see if the word exists.

//this is built somewhere else
//if a separate hashing function is needed due to heavy collisions
//initialize with an implementation of IEqualityComparer that performs the hash and comparison
static HashSet<string> words = new HashSet<string>();

public static void AddWord(string word)
{
    words.Add(word);
}

public static bool FileContains(string word)
{
    return words.Contains(word);
}

Update: modified to use hashset over hashtable (pointed out by Brian), cleaned up comments.

\$\endgroup\$
3
  • \$\begingroup\$ Using a HashSet<int> is a bad idea, since duplicate hashes will trigger false positives in FileContains. Instead, use HashSet<string> words = new HashSet<string>(MyStringIEnumerable,StringComparer.InvariantCultureIgnoreCase);. If you want a custom hash function, pass a custom IEqualityComparer<string> (but StringComparer.InvariantCultureIgnoreCase is fine IMO). This also negates the need for ToLower. \$\endgroup\$
    – Brian
    Commented Aug 6, 2014 at 16:13
  • \$\begingroup\$ @Brian Thank you for your input. I cleaned it up too quickly without looking over it too much. Originally, I was trying to point out the fundamental idea of an average O(1) look-up instead of an O(n) scan he originally carried out. \$\endgroup\$
    – wbennett
    Commented Aug 6, 2014 at 16:22
  • \$\begingroup\$ I removed the contains check from AddWord. See the documentation: Add can safely be called on pre-existing items (in the case that the item already exists, nothing is Added and Add returns false). \$\endgroup\$
    – Brian
    Commented Aug 6, 2014 at 18:04
0
\$\begingroup\$

Divide your array into smaller arrays and run the method against each array on different threads.

Also add speed stripes at the start of your methods to make them go faster

public static bool FileContains(string Text)
{
    /// <--Speed stripes
    foreach (string line in ARRAYOFWORDS)  // Approximately 3 million strings
    {
        if (Text.Equals(line, StringComparison.CurrentCultureIgnoreCase))
        {
            return true;
        }
    }
    return false;
}
\$\endgroup\$
0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.