I have a web scraper that processes about 2,000 pages that I've tried to speed up by using a Parallel.ForEach loop. My current code (trimmed for brevity) is:
Parallel.ForEach(dataTable1.AsEnumerable(), row =>
{
scrape();
}
);
public void scrape()
{
HtmlWeb htmlWeb = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = htmlWeb.Load("http://www.website.com");
doScraping(doc);
}
When this used a regular foreach
loop, it worked. Now, it will process some number of rows and then I start getting the following exceptions when trying to retrieve the HTMLDocument
:
A first chance exception of type 'System.Net.WebException' occurred in System.dll
A first chance exception of type 'System.Net.WebException' occurred in HtmlAgilityPack.dll
The Operation Has Timed Out
What causes the timeout to happen when operating in the Parallel loop? It will get through the first 150-300 rows and then will timeout for each subsequent row.