Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

I am working on a scraping project which involves getting web data and parsing them for further use. I have been working using PHP and CURL to make scraping scripts which crawls web data and I make use of either PHP Dom or Simple HTML DOM Parser library for these kinds of projects.

On a recent project I encountered some challenges; initially I found the target website blocked my server IP such that the server could not make any successful requests to the site. Understanding these issues as common I bought a set of private proxies and tried to make request calls using them.

Though this could get successful response, I noticed the script is getting some kind of blocks after 2-3 continuous requests. On printing and checking the response I could see a pop-up asking for CAPTCHA validation. I could not see any captcha characters to be entered and it also shows an error “input error: invalid referrer”. On examining the source I could see some Google recaptcha scripts within. I’m stuck at this point and I m not able to execute my script.

My script is used for gathering data and it needs to go through a large number of pages periodically over the site. But in the current scenario I am not able to proceed with my script. I could see there are some options to overcome these captcha issues and scraping these kinds of sites too are common.

I have been checking my script performance and responses over last two months. I could see during first month I was able to execute very large number of requests from a single IP and I was able to get results. Later I get an IP block and used private proxies which could get me some results. Later I am facing now with the captcha trouble.

I would appreciate any help or suggestions in this regard.

(Often in this kind of questions I used to get a first comment as, ‘Have you asked for prior permission from the target?’ .I haven’t ,but I know there are many sites doing so to get the details out of sites and target sites may not often give access to them. I respect the legality and scraping etiquettes but I would like to know at what point I stuck and how could I overcome that! )

I could provide any supporting information if needed.

share|improve this question

closed as too broad by gnat, Bart van Ingen Schenau, MichaelT, jwenting, Dan Pichelman Aug 20 at 19:01

There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs.If this question can be reworded to fit the rules in the help center, please edit the question.

3  
Just a guess, but it seems that the site you are targeting is taking countermeasures to block your scraping script. Your best course of action is to contact the site's owner and try to come to an agreement for accessing the data. –  Bart van Ingen Schenau Aug 16 at 6:29
6  
So, you're basically encountering a sign on private property saying "Please stay off the lawn", and you're asking us to help you repeatedly step on the lawn? –  Kilian Foth Aug 16 at 7:29
1  
trying to get help circumventing security measures is not what we're here for –  jwenting Aug 20 at 12:22

1 Answer 1

There's something called netiquette. When you develop web scrapers you should oblige to these rules.

Remember, you're a guest, you're using their server's bandwith (which they pay for) and data (which is of their intellectual property) to your personal profit. They aren't legally bound to give you anything but they can take legal actions against data misuse and service detriment due to your "scraps".

Directives are there to be followed, not to be ignored at will.

My recommendations are:

  • Check the /robots.txt (if it exists) in order to see what directories/files you're allowed to access. If you're blocked, desist.
  • If /robots.txt doesn't exists you're implicitly allowed to access any non .htaccess protected directory and files.
  • Before doing any GET request do a HEAD request in order to see if the file changed its content since the last time you requested it.
  • Never fire a set of GET requests consecutively, use the crawl-delay directive on /robots.txt to sleep your process between requests or default to 1 - 5 seconds.
  • Check the resulting HTML (if you're requesting .html files) for meta tags and follow any rules specified by them concerning bots.

That should suffice to make a polite web scrapers.

Now, you're saying they IP banned you. Take the hint, they don't want you in. Don't set up proxys in order to bypass their measures, you're basically breaking in.

Would you like burglars to pole vault into your house after you set up an electrical fence?

Now, after you broke in you got a reCAPTCHA confirmation, which is used to stop automated bots from doing stuff they shouldn't be doing. Again, what part of "Hey, we don't want you in, get out!" you don't get?

share|improve this answer

Not the answer you're looking for? Browse other questions tagged or ask your own question.