The following script should get links which are stored in a text file (line by line), then put them into an array and finally scan each links' source code for a certain line. If this line is found, it will be pasted into a CSV file.

It works fine so far, but it takes ages to finish, since each link is 'opened' and the complete source code for that link is scanned for this specific line.

I'm looking for ideas on how to optimize the code to run faster.

Here is my code:

$filename = "products.txt";
$writecsv = "notavailable.csv";
global $products;

$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, "/tmp/abCk.txt");
curl_setopt($ch, CURLOPT_URL,"https://www.websitegoeshere.com");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "Login=USERNAME&Password=PASSWORD");

ob_start();      // prevent any output
curl_exec ($ch); // execute the curl command
ob_end_clean();  // stop preventing output

curl_close ($ch);
unset($ch);

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_COOKIEFILE, "/tmp/abCk.txt");

// Open the file
$fp = @fopen($filename, 'r') or die("products.txt not found"); 

// Add each line to an array
if ($fp) {
   $products = explode("\n", fread($fp, filesize($filename)));
}

fclose($fp);

$fpcsv = fopen($writecsv, 'w') or die("notavailable.csv not found");

foreach($products as $key => $val) {
    curl_setopt($ch, CURLOPT_URL,$val);
    $buf2 = curl_exec ($ch);
    $html = htmlentities($buf2);
    if (strpos($html, "/extension/silver.project/design/sc_base/images/available_yes.gif") !== false) {
        fputcsv($fpcsv, "available");
    } else {
        fputcsv($fpcsv, "not available");
    }
}

fclose($fpcsv);


curl_close ($ch);
echo "csv written successfully."

Any help is really welcomed. Thanks in advance!

share|improve this question
Can you time how long it takes to finish? What you consider "ages" to complete might be perfectly acceptable for this type of task. I have a scraper that grabs booking information for arrests from various county websites, one in particular that first grabs a list of links from one page, and then searches each page those links point to for specific information. It does indeed take ages. I don't use cURL. – jdstankosky Mar 13 at 14:57
it takes about 30 minutes with 700 pages to scan. – hurley Mar 13 at 15:18
It's essentially "browsing" each page (loading the full html)? This sounds about right. It takes my script about 4 minutes or so for around 100 pages. I run mine via CLI. If there ARE any optimizations, I want to hear them too, lol. – jdstankosky Mar 13 at 15:22
yep, its browsing the full html. well, in this case i'll accept the time. thanks for your help! – hurley Mar 13 at 15:24
@hurley instead of trying to sequentially download 700 links, you should try to do it concurrently. stackoverflow.com/questions/2253791/… – abuzittin gillifirca Mar 15 at 15:31

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.