How to improve performance of this algorithm?

Question

I have a text file with 100000 pairs: word and frequency.

aba 20
baba 30
...

I parse this file and put the words in

Dictionary<string,double> dictionary;

And I want to execute some search + order logic in the following code:

Parallel.For(0, 15000, body =>
{
    tempInputWord = Console.ReadLine();//from file source

    var adviceWords = dictionary
                .Where(p => p.Key.StartsWith(tempInputWord, StringComparison.Ordinal))
                .OrderByDescending(ks => ks.Value)
                .ThenBy(ks => ks.Key)
                .Select(s => s.Key)
                .Take(10).ToList();

    //some output
}

The problem: This code must run in less than 10 seconds.

On my computer (core i5 2400, 8gb RAM) with Parallel.For() - about 37sec without Parallel - about 126 sec.

Can you give me some advice how to increase performance?

UPDATE 1: I change code with @CodesInChaos and some other advices

for (int i = 0; i < userWordQuantity; i++)
{
    var searchWord =//take data from file(or other sources)

    int startIndex = kvList.BinarySearch(new KeyValuePair<string, int>(searchWord, 0), comparer);
    if (startIndex < 0)
        startIndex = ~startIndex;
    //32sec/19sec(parallel)
    var adviceWords = kvList
        //.AsParallel()
                .Skip(startIndex)
                .TakeWhile(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
                .OrderByDescending(ks => ks.Value)
                .ThenBy(ks => ks.Key, StringComparer.Ordinal)//add ordinal
                .Select(s => s.Key)
                .Take(10)
                .ToList();

    //91sec
    //var adviceWords = dictionary
    //            .Where(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
    //            .OrderByDescending(ks => ks.Value)
    //            .ThenBy(ks => ks.Key,StringComparer.Ordinal)//add ordinal
    //            .Take(10)
    //            .ToList();
    ...
}
public class MyComparer : Comparer<KeyValuePair<string, int>>
{
    public override int Compare(KeyValuePair<string, int> x, KeyValuePair<string, int> y)
    {
        return x.Key.CompareTo(y.Key);
    }
}

Now in best case I have:

32 sec (19 sec for parallel)

UPDATE 2: I try to use Trie (library from here https://trienet.codeplex.com/)

Simplified, with only search, without "OrderBy().ThenBy()"

var trie = new Trie<KeyValuePair<string, int>>();

foreach (var item in dictionary)
{
    trie.Add(item.Key, new KeyValuePair<string,int>(item.Key,item.Value));
}

for (int i = 0; i < userWordQuantity; i++)
{
    var searchWord =//take data from file(or other sources)

    var adviceWords = trie
        .Retrieve(searchWord)
        .ToList();
}

60 sec

If I use previous variant(simplified)

var kvList = dictionary.OrderBy(ks => ks.Key,StringComparer.Ordinal).ToList();
var comparer = new MyComparer();
for (int i = 0; i < userWordQuantity; i++)
{
    var searchWord =//take data from file(or other sources)

    int startIndex = kvList.BinarySearch(new KeyValuePair<string, int>(searchWord, 0), comparer);
    if (startIndex < 0)
        startIndex = ~startIndex;

    var test1 = kvList
        .Skip(startIndex)
        .TakeWhile(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
        .ToList();
}

15.5 sec

How do you do Parallel "Console.Readline" 15000 times in 37 seconds? I sure can't type that fast. — Ira Baxter, 14 hours ago
What about a compiled regular expression? Then just iterate the matches. — Adam Houldsworth, 14 hours ago
I can't see what that for (parallel or not) is for. You just want to repeat same search many times? — Adriano Repetti, 14 hours ago
... assuming you are really reading from a file instead of getting console input, you may be serializing on access to the file. — Ira Baxter, 14 hours ago
@chromigo A complete front-to-back example will help a lot here... — Adam Houldsworth, 14 hours ago

usr · Answer 1 · 2014-10-01 09:34:04Z

You are not at all using any algorithmic strength of the dictionary. Ideally, you'd use a tree structure so that you can perform prefix lookups. On the other hand you are within 3.7x of your performance goal. I think you can reach that by just optimizing the constant factor in your algorithm.

Don't use LINQ in perf-critical code. Manually loop over all collections and collect results into a List<T>. That turns out to give a major speed-up in practice.
Don't use a dictionary at all. Just use a KeyValuePair<T1, T2>[] and run through it using a foreach loop. This is the fastest possible way to traverse a set of pairs.

Could look like this:

KeyValuePair<T1, T2>[] items;
List<KeyValuePair<T1, T2>> matches = new ...(); //Consider pre-sizing this.

//This could be a parallel loop as well.
//Make sure to not synchronize too much on matches.
//If there tend to be few matches a lock will be fine.
foreach (var item in items) {
 if (IsMatch(item)) {
  matches.Add(item);
 }
}

matches.Sort(...); //Sort in-place

return matches.Take(10); //Maybe matches.RemoveRange(10, matches.Count - 10) is better

That should exceed a 3.7x speedup.

If you need more, try stuffing the items into a dictionary keyed on the first char of Key. That way you can look up all items matching tempInputWord[0]. That should reduce search times by the selectivity that is in the first char of tempInputWord. For English text that would be on the order of 26 or 52. This is a primitive form of prefix lookup that has one level of lookup. Not pretty but maybe it is enough.

It is even easier. He is RESORTING the keys in the dictionary. On every run. Twice (order by then by). No idea why he even considered choosing a dictionary - totally unsuitable selection. — TomTom, 14 hours ago
He's only sorting the matches. Hopefully, there are only a few. — usr, 14 hours ago
Don't use LINQ in perf-critical code. Manually loop over all collections and collect results into a List<T>. That turns out to give a major speed-up in practice. -- do you have anything to back this up? That sounds very dubious. — insta, 7 hours ago

T_D · Answer 2 · 2014-10-01 20:53:48Z

I think the best way would be to use a Trie data structure instead of a dictionary. A Trie data structure saves all the words in a tree structure. A node can represent all the words that start with the same letters. So if you look for your search word tempInputWord in a Trie you will get a node that represents all the words starting with tempInputWord and you just have to traverse through all the child nodes. So you just have one search operation. The link to the Wikipedia article also mentions some other advantages over hash tables (that's what an Dictionary is basically):

Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.

There are no collisions of different keys in a trie.

Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.

There is no need to provide a hash function or to change hash functions as more keys are added to a trie.

A trie can provide an alphabetical ordering of the entries by key.

And here are some ideas for creating a trie in C#.

This should at least speed up the lookup, however, building the Trie might be slower.

Update: Ok, I tested it myself using a file with frequencies of english words that uses the same format as yours. This is my code which uses the Trie class that you also tried to use.

    static void Main(string[] args)
    {
        Stopwatch sw = new Stopwatch();

        sw.Start();
        var trie = new Trie<KeyValuePair<string,int>>();

        //build trie with your value pairs
        var lines = File.ReadLines("en.txt");
        foreach(var line in lines.Take(100000))
        {
            var split = line.Split(' ');
            trie.Add(split[0], new KeyValuePair<string,int>(split[0], int.Parse(split[1])));
        }

        Console.WriteLine("Time needed to read file and build Trie with 100000 words: " + sw.Elapsed);

        sw.Reset();

        //test with 10000 search words
        sw.Start();
        foreach (string line in lines.Take(10000))
        {
            var searchWord = line.Split(' ')[0];
            var allPairs = trie.Retrieve(searchWord);
            var bestWords = allPairs.OrderByDescending(kv => kv.Value).ThenBy(kv => kv.Key).Select(kv => kv.Key).Take(10);

            var output = bestWords.Aggregate("", (s1, s2) => s1 + ", " + s2);
            Console.WriteLine(output);

        }

        Console.WriteLine("Time to process 10000 different searchWords: " + sw.Elapsed);
    }

My results on a pretty similar machine:

Time needed to read file and build Trie with 100000 words: 00:00:00.7397839
Time to process 10000 different searchWords: 00:00:03.0181700

So I think you are doing something wrong that we cannot see. For example the way you measure the time or the way you read the file. As my results show this stuff should be really fast. The 3 seconds are mainly due to the Console output in the loop which I needed so that the bestWords variable is used. Otherwise the variable would have been optimized away.

Nice solution, I will try to implement him and write about result later. — chromigo, 13 hours ago
I tried this variant, but failed. Maybe I did something wrong. I updated my post (update 2, details there) . — chromigo, 4 hours ago
I need some clarification: Which operations do you measure the time of? Is it just the lookup of the best 10 words or do you also include the creation of the Dictionary/Trie? As I said the creation of the Trie should be slower than the creation of the dictionary. With a Trie you shouldn't need the Dictionary anymore but I still see it in your code of Update 2. — T_D, 4 hours ago
I updated my post with an example implementation. It works pretty fast for me. Hope it helps. — T_D, 2 hours ago

Pieter21 · Answer 3 · 2014-10-01 09:40:36Z

up vote 1 down vote

If you want profiling, separate the reading of the file and see how long that takes. Also data calculation, collection, presentation could be different steps.

If you want concurrence AND a dictionary, look at the ConcurrentDictionary, maybe even more for reliability than for performance, but probably for both:

http://msdn.microsoft.com/en-us/library/dd287191(v=vs.110).aspx

answered 14 hours ago

Pieter21
85629

He's only doing reads of the dictionary in parallel. If it isn't implemented stupidly, that's likely to be safe, and he doesn't need to pay the overhead of concurrent dictionary internal locks. – Ira Baxter 14 hours ago

add a comment |

CodesInChaos · Answer 4 · 2014-10-01 13:46:13Z

Replace the dictionary by a List<KeyValuePair<string, decimal>>, sorted by the key.

For the search I use that a substring sorts directly before its prefixes with ordinal comparisons. So I can use a binary search to find the first candidate. Since the candidates are contiguous I can replace Where with TakeWhile.

int startIndex = dictionary.BinarySearch(searchWord, comparer);
if(startIndex < 0)
    startIndex = ~startIndex;

var adviceWords = dictionary
            .TakeWhile(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
            .OrderByDescending(ks => ks.Value)
            .ThenBy(ks => ks.Key)
            .Select(s => s.Key)
            .Take(10).ToList();

Make sure to use ordinal comparison for all operations, including the initial sort, the binary search and the StartsWith check.
I would call Console.ReadLine outside the parallel loop. Probably using AsParallel().Select(...) on the collection of search words instead of Parallel.For.

Good advice, but may be you skip ".Skip(startIndex)" before ".TakeWhile(...)" — chromigo, 7 hours ago

KevinZ · Answer 5 · 2014-10-01 22:27:06Z

Assuming the 10 is constant, then why is everyone storing the entire data set? Memory is not free. The fastest solution is to store the first 10 entries into a list, sort it. Then, maintain the 10-element-sorted-list as you traverse through the rest of the data set, removing the 11th element every time you insert an element.

The above method works best for small values. If you had to take the first 5000 objects, consider using a binary heap instead of a list.

brz · Answer 6 · 2014-10-01 22:28:27Z

Assuming all your Keys are longer than n characters, you can split the problem into two parts. A part for inputs longer than n characters using a super fast dictionary approach and a part for other inputs using a slow iterational approach. Here, Key length constraint is 3:

var sorted = dictionary.OrderByDescending(x => x.Value).ThenBy(x => x.Key).Select(x => x.Key).ToList();
        var dic = sorted.GroupBy(x => x.Substring(0, 3)).ToDictionary(x => x.Key, x => x.Select(y => y));
        List<string> adviceWords = null;
        foreach (var i in Enumerable.Range(0, 15000))
        {
            var str = Console.ReadLine();
            if (str.Length >= 3)
            {
                var l = dic.ContainsKey(str.Substring(0, 3)) ? dic[str.Substring(0, 3)] : null;
                adviceWords = l == null ? null : l.Where(x => x.StartsWith(str)).Take(10).ToList();
            }
            else
            {
                adviceWords = sorted.Where(x => x.StartsWith(str)).Take(10).ToList(); 
            }
        }

asked	today
viewed	682 times
active	today

current community

your communities

more stack exchange communities

How to improve performance of this algorithm?

Can you give me some advice how to increase performance?

6 Answers 6

Your Answer

Not the answer you're looking for? Browse other questions tagged c# .net performance dictionary parallel-processing or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

How to improve performance of this algorithm?

Can you give me some advice how to increase performance?

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c# .net performance dictionary parallel-processing or ask your own question.

Linked

Related

Hot Network Questions