Looping through an array of strings containing approximately 3 million elements

Question

I have a function that loops through an array of strings containing approximately 3 million lines of strings. My problem is this loop is running too slowly and I want to know how to make it faster.

string[] words = { "Hello", "world", "!", "My", "name", "is", "Something", "."};
    List<MyClass> result = new List<MyClass>();

    for (int i = 0; i < words.Length; i++)
    {
        if (words[i] == "&&&")
        { }
        else if (words.Length > i + 2 && FileContains(words[i] + " " + words[i + 1] + " " + words[i + 2]))
        {
            result.Add(new MyClass(words[i] + " " + words[i + 1] + " " + words[i + 2]));
            words[i + 1] = "&&&";
            words[i + 2] = "&&&";
        }
        else if (words.Length > i + 1 && FileContains(words[i] + " " + words[i + 1]))
        {
            result.Add(new MyClass(words[i] + " " + words[i + 1]));
            words[i + 1] = "&&&";
        }
        else
            result.Add(new MyClass(words[i], Parent));

    }

And my FileContains function which loops through the array of strings:

public static bool FileContains(string Text)
{
    foreach (string line in ARRAYOFWORDS)  // Approximately 3 million strings
    {
        if (Text.ToLower() == line.Substring(0, line.LastIndexOf('|')))
        {
            return true;
        }
    }
    return false;
}

Ryan Mann · Accepted Answer · 2014-10-07 18:36:28Z

What is this for? It looks like a string search function. And you are storing the found results in an output.

As such, I recommend not using an array at all. You should make use of string.IndexOf, which uses Boyer Moyer Search Algorithm under the hood, very fast.

Instead just make one BIG string with line breaks, for example...

string stringToSearch = "some line here\nanother line here\nand another line here\netc etc etc";

int matchIndex = 0;
while (matchIndex > -1)
{
    matchIndex = stringToSearch.IndexOf("string to search for");
    if (matchIndex > -1)
    {
        //use string functions to get matched word
    }
}

Something like that for search a string, would be astronomically faster.

You can load a file into a string in one move. The TextReader object in the System.IO name space has a method called ReadToEnd which returns the entire file as a string with lines separated by new line characters "\n".

And, if you are dealing with a binary file, you can still search it string.IndexOf. Just convert the binary file into a string with ASCIIEncoding.GetString(someByteArray), then convert your search for bytes to a string the same way. Then search for the search bytes string in the converted file string.

Jon Skeet · Accepted Answer · 2014-08-06 15:29:29Z

Well you can optimize FileContains very simply - stop calling Text.ToLower() on every iteration!

In fact, at that point it's really simple to rewrite. Here's the code for the original question, which didn't have the Substring part:

public static bool FileContains(string text)
{
    // Variable name changed to save my eyes from the horror.
    return lines.Contains(text.ToLower());
}

If you store the lines in a HashSet<string> instead of an array or a list, it will be blazingly fast.

With the Substring part, I'd create a new HashSet<string> with just those substrings, assuming you want to look up more than once:

HashSet<string> lineStarts;

...

lineStarts = new HashSet<string>(lines.Select(x => x.Substring(0, x.IndexOf('|'));

Then use lineStarts within FileContains.

Note that using ToLower is a fairly crude approach to case sensitivity. You should think carefully about what this needs to do in terms of culture-sensitivity etc. See the relevant MSDN page for lots more information. (Ain't culture fun?)

dbc · Accepted Answer · 2014-08-07 10:17:29Z

I can see a few issues here:

You are searching the array of lines for a string, then searching it again for a substring, on each iteration. This is unnecessary. Search for lines beginning with the substring, and for each, see if they match match either the longer or shorter string completely.
Since c# strings are immutable, any string transformation such as ToLower() or SubString() necessarily makes a transformed copy of the string. In tight inner loops this can kill performance, so don't do this.
Consider using a HashSet<string> for your array of text. You will need to add the substring before the last '|' for each line. However, it's possible that building the HashSet will take longer than the searches, since you don't actually run many searches.
Consider parallelizing the searches -- though this is a little tricky, since a match on a preceding loop changes the logic on later loops (by writing the "&&&" characters into the word array).
Choose the right StringComparison for your needs.
line.Substring(0, line.LastIndexOf('|')) will throw an exception if the '|' character is not found. Do you want this?
You lowercase the search strings but not the file lines, so if the file lines have any uppercase letters the searches will fail. Is this desired?

Here's a prototype fix for 1, 2, 5 and 6. It's a little gross but should be faster. I also renamed your string array to FileLines:

    string[] words = { "Hello", "world", "!", "My", "name", "is", "Something", "." };
    List<MyClass> result = new List<MyClass>();
    StringComparison comparisonType = StringComparison.CurrentCulture; // Choose wisely.

    for (int i = 0; i < words.Length; i++)
    {
        if (words[i] == null)
            continue;
        var words3 = (words.Length > i + 2 ? words[i] + " " + words[i + 1] + " " + words[i + 2] : null);
        var words2 = (words.Length > i + 1 ? words[i] + " " + words[i + 1] : null);
        var lower3 = (words3 != null ? words3.ToLower() : null);
        var lower2 = (words2 != null ? words2.ToLower() : null);
        bool add3 = false;
        bool add2 = false;

        if (lower2 != null)
        {
            var length2 = lower2.Length;
            var length3 = (lower3 != null ? lower3.Length : -1);
            foreach (var possibleMatch in FileLines.Where(s => s.StartsWith(lower2, comparisonType)))
            {
                if (lower3 != null && possibleMatch.StartsWith(lower3, comparisonType) && MatchesLength(possibleMatch, length3))
                {
                    add3 = true;
                    break;
                }
                else if (!add2 && MatchesLength(possibleMatch, length2))
                {
                    add2 = true;
                    if (lower3 == null)
                        break;
                    // Otherwise do not break!  Continue to look for 3-word matches
                }
            }
        }

        if (add3)
        {
            result.Add(new MyClass(words3));
            words[i + 1] = null;
            words[i + 2] = null;
        }
        else if (add2)
        {
            result.Add(new MyClass(words2));
            words[i + 1] = null;
        }
        else
        {
            result.Add(new MyClass(words[i], Parent));
        }
    }

And then

    static bool MatchesLength(string possibleMatch, int length) 
    {
        // Ignore anything after the last '|', if present
        return (possibleMatch.Length == length || possibleMatch.LastIndexOf('|') == length);
    }

Major · Accepted Answer · 2014-08-06 15:41:16Z

3

Use parallel LINQ, with string equals:

public static bool FileContains(string Text)
{
    return ARRAYOFWORDS
    .AsParallel()
    .Any(x => string.Equals(x.Substring(0, line.LastIndexOf('|')), Text,
    StringComparison.InvariantCultureIgnoreCase));
}

answered Aug 6, 2014 at 15:41

Major

1313 bronze badges

Add a comment |

Jamal · Accepted Answer · 2014-11-28 15:42:21Z

A few optimisation that can be done:

This:

if (words[i] == "&&&"){ }

Since you ignore &&& and otherwise set the value to &&& in the else part, you can increment i to i+2/1 (as per the else if block) to skip some iterations.

Just incrementing would save you the effort of setting the word to &&& i.e

if (words[i] == "&&&")
    { }
    else if (words.Length > i + 2 && FileContains(words[i] + " " + words[i + 1] + " " + words[i + 2]))
    {
        result.Add(new MyClass(words[i] + " " + words[i + 1] + " " + words[i + 2]));
        i=i+2;
    }

Extracting Text.ToLower() before the loop.
Compare the string length before string comparison.

wbennett · Accepted Answer · 2014-08-06 15:36:34Z

2

You could build a hashset of your word collection and take advantage of a constant time look up to see if the word exists.

//this is built somewhere else
//if a separate hashing function is needed due to heavy collisions
//initialize with an implementation of IEqualityComparer that performs the hash and comparison
static HashSet<string> words = new HashSet<string>();

public static void AddWord(string word)
{
    words.Add(word);
}

public static bool FileContains(string word)
{
    return words.Contains(word);
}

Update: modified to use hashset over hashtable (pointed out by Brian), cleaned up comments.

answered Aug 6, 2014 at 15:36

wbennett

1212 bronze badges

\$\begingroup\$ Using a HashSet<int> is a bad idea, since duplicate hashes will trigger false positives in FileContains. Instead, use HashSet<string> words = new HashSet<string>(MyStringIEnumerable,StringComparer.InvariantCultureIgnoreCase);. If you want a custom hash function, pass a custom IEqualityComparer<string> (but StringComparer.InvariantCultureIgnoreCase is fine IMO). This also negates the need for ToLower. \$\endgroup\$
– Brian
Commented Aug 6, 2014 at 16:13
\$\begingroup\$ @Brian Thank you for your input. I cleaned it up too quickly without looking over it too much. Originally, I was trying to point out the fundamental idea of an average O(1) look-up instead of an O(n) scan he originally carried out. \$\endgroup\$
– wbennett
Commented Aug 6, 2014 at 16:22
\$\begingroup\$ I removed the contains check from AddWord. See the documentation: Add can safely be called on pre-existing items (in the case that the item already exists, nothing is Added and Add returns false). \$\endgroup\$
– Brian
Commented Aug 6, 2014 at 18:04

Add a comment |

Weyland YutaniWeyland Yutani · Accepted Answer · 2014-08-06 15:34:09Z

0

Divide your array into smaller arrays and run the method against each array on different threads.

Also add speed stripes at the start of your methods to make them go faster

public static bool FileContains(string Text)
{
    /// <--Speed stripes
    foreach (string line in ARRAYOFWORDS)  // Approximately 3 million strings
    {
        if (Text.Equals(line, StringComparison.CurrentCultureIgnoreCase))
        {
            return true;
        }
    }
    return false;
}

answered Aug 6, 2014 at 15:34

Weyland Yutani

Add a comment |

Stack Exchange Network

Looping through an array of strings containing approximately 3 million elements

7 Answers 7

Your Answer

Hot Network Questions

Looping through an array of strings containing approximately 3 million elements

7 Answers 7

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions