Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I have a web scraper that processes about 2,000 pages that I've tried to speed up by using a Parallel.ForEach loop. My current code (trimmed for brevity) is:

Parallel.ForEach(dataTable1.AsEnumerable(), row =>
{
    scrape();
}
);

public void scrape()
{
    HtmlWeb htmlWeb = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = htmlWeb.Load("http://www.website.com");
    doScraping(doc);
}

When this used a regular foreach loop, it worked. Now, it will process some number of rows and then I start getting the following exceptions when trying to retrieve the HTMLDocument:

A first chance exception of type 'System.Net.WebException' occurred in System.dll

A first chance exception of type 'System.Net.WebException' occurred in HtmlAgilityPack.dll

The Operation Has Timed Out

What causes the timeout to happen when operating in the Parallel loop? It will get through the first 150-300 rows and then will timeout for each subsequent row.

share|improve this question
1  
Sounds like the site you want to scrap blocks you because of the huge amount of request you do in parallel, which looks to them like a DOS attack. –  shriek May 4 '13 at 17:32
    
@shriek: I don't think think the site is blocking me. I am able to access it from a browser while I'm continuously getting timeout errors. Also, if I restart the application, it works again temporarily. –  Soma Holiday May 5 '13 at 3:28
    
Adding in new ParallelOptions { MaxDegreeOfParallelism = 4 } seems to reduce my problem to a very occasional timeout. I'm running the app on a 4 core processor. I'm still curious why parallel doesn't handle this better. –  Soma Holiday May 5 '13 at 4:12

1 Answer 1

I think it's because you have a limit on the maximum number of simultaneous HttpWebRequest connections to a site. Check this .NET setting: ConnectionManagement Element (Network Settings).

You can also do it programmatically: How can I programmatically remove the 2 connection limit in WebClient

It works with a browser in parallel because it uses another process.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.