Sign up ×
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only takes a minute:

I am looking for help to write an efficient PHP algorithm to help me find occurances of a String within another string. Here is currently the situation.

I have two arrays. The first array is the array with text that needs searched (haystack). The second array is an array of terms ot find (needle).

I know that my first array has at least one of my terms from the needles. So, the algorithm needs to say 'is array2[0] found inside array1[0]? if not, loop, is array2[1] found inside array1[0], etc' If it is found, exit, advance array1[1] pointer and repeat the process.

I want to make sure this is efficient as I have 10s of 1000s of entries to pricess, and my needle array has 1100 individual needles.

share|improve this question
1  
You're probably looking for the Boyer-Moore algorithm or one of its variants – they have approximately O(N) complexity. The original lets you cache a preprocessing step which could save you some time if you reuse the same needles a lot. – millimoose Feb 15 '12 at 1:48
1  
(johannburkard.de/software/stringsearch has a bunch of decent implementations of the algorithms you could try and port into PHP, or search for an existing one.) – millimoose Feb 15 '12 at 1:52

2 Answers 2

up vote 0 down vote accepted

Ok, let's start with this algorithm, it might not be the fastest but the result is what you want. (Keep loping UNTIL you found the first match)

<?php
for ($i = 0; $i < 1000; $i++) {
    $haystack[] = "Lorem ipsum dolor";
    $needle[] = "no match";
}
// $haystack = array("Lorem ipsum dolor", "Quisque placerat", "Cras quis porttitor orci");
//$needle = array("quis", "Lorem");
$timestamp1 = time() +  microtime();
foreach ($haystack as $word){
    foreach ($needle as $pattern){
        if(strpos($word, $pattern) === false){
            //Keep looping
        }else{
            //exit inner loop
            print "'".$pattern."' is in '".$word."'<br />";
            break;
        }
    }
}

$timestamp2 = time() + microtime();
print "It took me ".($timestamp2 - $timestamp1)." seconds to realize there was no match";

?>

//EDIT: I commented the hard coded array, creating it now dynamically an added a timer. It takes about 1 second max, if there is no match.

share|improve this answer
    
Johannes, thanks for that. Your script uses strpos while mine was using stristri. Switching functions made my script perform a LOT better. – user658182 Feb 15 '12 at 2:29

A trie data structure of the haystack recorded with some other informations like word position (page, line and word number) is more efficient. It uses a divide and conquer strategy to avoid useless lookups. With a loop strategy every item in the haystack would be searched. A trie sort the haystack and you can skip some haystacks. Here is an example in PHP: http://phpir.com/tries-and-wildcards

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.