Optimizing PHP script fetching entire HTML pages

Question

The following script should get links which are stored in a text file (line by line), then put them into an array and finally scan each links' source code for a certain line. If this line is found, it will be pasted into a CSV file.

It works fine so far, but it takes ages to finish, since each link is 'opened' and the complete source code for that link is scanned for this specific line.

I'm looking for ideas on how to optimize the code to run faster.

Here is my code:

$filename = "products.txt";
$writecsv = "notavailable.csv";
global $products;

$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, "/tmp/abCk.txt");
curl_setopt($ch, CURLOPT_URL,"https://www.websitegoeshere.com");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "Login=USERNAME&Password=PASSWORD");

ob_start();      // prevent any output
curl_exec ($ch); // execute the curl command
ob_end_clean();  // stop preventing output

curl_close ($ch);
unset($ch);

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_COOKIEFILE, "/tmp/abCk.txt");

// Open the file
$fp = @fopen($filename, 'r') or die("products.txt not found"); 

// Add each line to an array
if ($fp) {
   $products = explode("\n", fread($fp, filesize($filename)));
}

fclose($fp);

$fpcsv = fopen($writecsv, 'w') or die("notavailable.csv not found");

foreach($products as $key => $val) {
    curl_setopt($ch, CURLOPT_URL,$val);
    $buf2 = curl_exec ($ch);
    $html = htmlentities($buf2);
    if (strpos($html, "/extension/silver.project/design/sc_base/images/available_yes.gif") !== false) {
        fputcsv($fpcsv, "available");
    } else {
        fputcsv($fpcsv, "not available");
    }
}

fclose($fpcsv);


curl_close ($ch);
echo "csv written successfully."

Any help is really welcomed. Thanks in advance!

Can you time how long it takes to finish? What you consider "ages" to complete might be perfectly acceptable for this type of task. I have a scraper that grabs booking information for arrests from various county websites, one in particular that first grabs a list of links from one page, and then searches each page those links point to for specific information. It does indeed take ages. I don't use cURL.
It's essentially "browsing" each page (loading the full html)? This sounds about right. It takes my script about 4 minutes or so for around 100 pages. I run mine via CLI. If there ARE any optimizations, I want to hear them too, lol.
yep, its browsing the full html. well, in this case i'll accept the time. thanks for your help!
@hurley instead of trying to sequentially download 700 links, you should try to do it concurrently. stackoverflow.com/questions/2253791/…

asked	16 days ago
viewed	66 times

Optimizing PHP script fetching entire HTML pages

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Browse other questions tagged php performance html curl or ask your own question.

Community Bulletin

Optimizing PHP script fetching entire HTML pages

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged php performance html curl or ask your own question.

Community Bulletin

Related