The fastest С++ algorithm for string testing against a list of predefined seeds (case insensitive)

Question

I have list of seed strings. List contain about 100 predefined strings. All strings contain only ASCII chars.

std::list<std::wstring> seeds{ L"google", L"yahoo", L"stackoverflow"};

My app constantly receives a lot of strings which can contain any characters. I need check each received line and decide whether it contain any of seeds or not. Comparison must be case insensitive.

I need the fastest possible algorithm to test received string.

Right now my app uses this algo:

std::wstring testedStr;
for (auto & seed : seeds)
{
    if (boost::icontains(testedStr, seed))
    {
        return true;
    }
}
return false;

It works well, but I'm not sure that this is most efficient way. How is it possible to implement the algorithm in order to achieve better performance?

P.S. This is Windows app. App receives valid std::wstring strings.

As a tiny improvement I'd suggest to replace std::list with an array (normal array, std::array or std::vector). It may improve performance a bit. Also, why do you have L prefix only on one of the literals? — HolyBlackCat, 5 hours ago
they contain a seed or they are a seed? there's a difference — David Haim, 4 hours ago
@HolyBlackCat I am sorry for all that typos )) I placed L before other two literals. — Victor Mezrin, 4 hours ago
have you tried std::search and found out it's not fast enough? — David Haim, 4 hours ago
hmmm.. so simply use tolower on each string. or put a specifi comparator if its ascii only it shouldn't take a lot of time anyway. my point is to try to milk the standard library as much as possible, it is very fast — David Haim, 4 hours ago

VermillionAzure · Answer 1 · 2016-05-29 12:06:16Z

If there are a finite amount of matching strings, this means that you can construct a tree such that, read from root to leaves, similar strings will occupy similar branches.

This is also known as a trie.

For example, we might have the strings cat, coach, con, conch as well as dark, dad, dank, do. Their trie might look like this:

A search for one of the words in the tree will search the tree, starting from a root. Making it to a leaf would correspond to a match to a seed. Regardless, each character in the string should match to one of their children. If it does not, you can terminate the search (e.g. you would not consider any words starting with "g" or any words beginning with "cu").

There are various algorithms for constructing the tree as well as searching it as well as modifying it on the fly, but I thought I would give a conceptual overview of the solution instead of a specific one since I don't know of the best algorithm for it.

Conceptually, an algorithm you might use to search the tree would be related to the idea behind radix sort of a fixed amount of categories or values that a character in a string might take on at a given point in time.

Best answer to me :) it shows partially how to implement a Regex, wich is what the OP needs in the end (though he don't need the whol regex capabilities) — DarioOO, 3 hours ago

RiaD · Answer 2 · 2016-05-29 12:09:16Z

You can use Aho–Corasick algorithm

It builds trie/automaton where some vertices marked as terminal which would mean string has seeds.

It's built in O(sum of dictionary word lengths) and gives the answer in O(test string length)

Advantages:

It's specifically works with several dictionary words and check time doesn't depend on number of words (If we not consider cases where it doesn't fit to memory etc)
The algorithm is not hard to implement (comparing to suffix structures at least)

You may make it case insensitive by lowering each symbol if it's ASCII (non ASCII chars don't match anyway)

DarioOO · Answer 3 · 2016-05-29 13:26:19Z

You should try a pre-existing regex utility, it may be slower than your hand-rolled algorithm but regex is about matching multiple possibilities, so it is likely it will be already several times faster than a hashmap or a simple comparison to all strings. I believe regex implementations may already use the Aho–Corasick algorithm mentioned by RiaD, so basically you will have at your disposal a well tested and fast implementation.

If you have C++11 you already have a standard regex library

#include <string>
#include <regex>

int main(){
     std::regex self_regex("google|yahoo|stackoverflow");
     regex_match(input_string ,self_regex);
}

I expect this to generate the best possible minimum match tree, so I expect it to be really fast (and reliable!)

Александр Лысенко · Answer 4 · 2016-05-29 11:52:40Z

One of the faster ways is to use suffix tree https://en.wikipedia.org/wiki/Suffix_tree, but this approach has huge disadvantage - it is difficult data structure with difficult constructing. This algorithm allows to build tree from string in linear complexity https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm

Arkady · Answer 5 · 2016-05-29 15:10:25Z

Because number of string is not big (~100), you can use next algo:

Calculate max length of word you have. Let it be N.
Create int checks[N]; array of checksum.
Let's checksum will be sum of all characters in searching phrase. So, you can calculate such checksum for each word from your list (that is known at compile time), and create std::map<int, std::vector<std::wstring>>, where int is checksum of string, and vector should contain all your strings with that checksum. Create array of such maps for each length (up to N), it can be done at compile time also.
Now move over big string by pointer. When pointer points to X character, you should add value of X char to all checks integers, and for each of them (numbers from 1 to N) remove value of (X-K) character, where K is number of integer in checks array. So, you will always have correct checksum for all length stored in checks array. After that search on map does there exists strings with such pair (length & checksum), and if exists - compare it.

It should give false-positive result (when checksum & length is equal, but phrase is not) very rare.

So, let's say R is length of big string. Then looping over it will take O(R). Each step you will perform N operations with "+" small number (char value), N operations with "-" small number (char value), that is very fast. Each step you will have to search for counter in checks array, and that is O(1), because it's one memory block.

Also each step you will have to find map in map's array, that will also be O(1), because it's also is one memory block. And inside map you will have to search for string with correct checksum for log(F), where F is size of map, and it will usually contain no more then 2-3 strings, so we can in general pretend that it is also O(1).

Also you can check, and if there is no strings with same checksum (that should happens with high chance with just 100 words), you can discard map at all, storing pairs instead of map.

So, finally that should give O(R), with quite small O. This way of calculating checksum can be changed, but it's quite simple and completely fast, with really rare false-positive reactions.

asked	today
viewed	319 times
active	today

current community

your communities

more stack exchange communities

The fastest С++ algorithm for string testing against a list of predefined seeds (case insensitive)

5 Answers 5

Your Answer

Not the answer you're looking for? Browse other questions tagged c++ string windows algorithm or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

The fastest С++ algorithm for string testing against a list of predefined seeds (case insensitive)

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++ string windows algorithm or ask your own question.

Related

Hot Network Questions