Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Need to Stop Bots from Killing my Webserver

I am having EXTREME bot problems on the some of my websites within my hosting account. The bots utilize over 98% of my CPU resources and 99% of my bandwidth for my entire hosting account. These bots are generating over 1 GB of traffic per hour for my sites. The real human traffic for all of these sites is less than 100 MB / month.

I have done extensive research on both robots.txt and .htaccess file to block these bots but all methods failed.

I have also put code in the robots.txt files to block access to the scripts directories, but these bots (Google, MS Bing, and Yahoo) ignore the rules and run the scripts anyways.

I do not want to completely block the Google, MS Bing, and Yahoo bots, but I want to limit there crawl rate. Also, adding a Crawl-delay statement in the robots.txt file does not slow down the bots. My current robots.txt and .htacces code for all sites are stated below.

I have setup both Microsoft and Google webmaster tools to slow down the crawl rate to the absolute minimum, but they are still hitting these sites at a rate of 10 hits / second.

In addition, every time I upload a file that causes an error, the entire VPS webserver goes down within seconds such that I cannot even access the site correct the issue due to the onslaught of hits by these bots.

What can I do to stop the on-slot of traffic to my websites?

I tried asking my web hosting company (site5.com) many times about this issue in the past months and they cannot help me with this problem.

What I really need is to prevent the Bots from running the rss2html.php script. I tried both sessions and cookies and both failed.

robots.txt

User-agent: Mediapartners-Google
Disallow: 
User-agent: Googlebot
Disallow: 
User-agent: Adsbot-Google
Disallow: 
User-agent: Googlebot-Image
Disallow: 
User-agent: Googlebot-Mobile
Disallow: 
User-agent: MSNBot
Disallow: 
User-agent: bingbot
Disallow: 
User-agent: Slurp
Disallow: 
User-Agent: Yahoo! Slurp
Disallow: 
# Directories
User-agent: *
Disallow: /
Disallow: /cgi-bin/
Disallow: /ads/
Disallow: /assets/
Disallow: /cgi-bin/
Disallow: /phone/
Disallow: /scripts/
# Files
Disallow: /ads/random_ads.php
Disallow: /scripts/rss2html.php
Disallow: /scripts/search_terms.php
Disallow: /scripts/template.html
Disallow: /scripts/template_mobile.html

.htaccess

ErrorDocument 400 http://english-1329329990.spampoison.com
ErrorDocument 401 http://english-1329329990.spampoison.com
ErrorDocument 403 http://english-1329329990.spampoison.com
ErrorDocument 404 /index.php
SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
SetEnvIfNoCase User-Agent "^baidu*" bad_bot
Order Deny,Allow
Deny from env=bad_bot
RewriteEngine on
RewriteCond %{HTTP_user_agent} bot\* [OR]
RewriteCond %{HTTP_user_agent} \*bot
RewriteRule ^.*$ http://english-1329329990.spampoison.com [R,L]
RewriteCond %{QUERY_STRING} mosConfig_[a-zA-Z_]{1,21}(=|\%3D) [OR]
# Block out any script trying to base64_encode crap to send via URL
RewriteCond %{QUERY_STRING} base64_encode.*\(.*\) [OR]
# Block out any script that includes a <script> tag in URL
RewriteCond %{QUERY_STRING} (\<|%3C).*script.*(\>|%3E) [NC,OR]
# Block out any script trying to set a PHP GLOBALS variable via URL
RewriteCond %{QUERY_STRING} GLOBALS(=|\[|\%[0-9A-Z]{0,2}) [OR]
# Block out any script trying to modify a _REQUEST variable via URL
RewriteCond %{QUERY_STRING} _REQUEST(=|\[|\%[0-9A-Z]{0,2})
# Send all blocked request to homepage with 403 Forbidden error!
RewriteRule ^(.*)$ index.php [F,L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !^/index.php
RewriteCond %{REQUEST_URI} (/|\.php|\.html|\.htm|\.feed|\.pdf|\.raw|/[^.]*)$  [NC]
RewriteRule (.*) index.php
RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
# Don't show directory listings for directories that do not contain an index file (index.php, default.asp etc.)
Options -Indexes
<Files http://english-1329329990.spampoison.com>
order allow,deny
allow from all
</Files>
deny from 108.
deny from 123.
deny from 180.
deny from 100.43.83.132

UPDATE TO SHOW ADDED USER AGENT BOT CHECK CODE

<?php
function botcheck(){
 $spiders = array(
   array('AdsBot-Google','google.com'),
   array('Googlebot','google.com'),
   array('Googlebot-Image','google.com'),
   array('Googlebot-Mobile','google.com'),
   array('Mediapartners','google.com'),
   array('Mediapartners-Google','google.com'),
   array('msnbot','search.msn.com'),
   array('bingbot','bing.com'),
   array('Slurp','help.yahoo.com'),
   array('Yahoo! Slurp','help.yahoo.com')
 );
 $useragent = strtolower($_SERVER['HTTP_USER_AGENT']);
 foreach($spiders as $bot) {
   if(preg_match("/$bot[0]/i",$useragent)){
     $ipaddress = $_SERVER['REMOTE_ADDR']; 
     $hostname = gethostbyaddr($ipaddress);
     $iphostname = gethostbyname($hostname);
     if (preg_match("/$bot[1]/i",$hostname) && $ipaddress == $iphostname){return true;}
   }
 }
}
if(botcheck() == false) {
  // User Login - Read Cookie values
     $username = $_COOKIE['username'];
     $password = $_COOKIE['password'];
     $radio_1 = $_COOKIE['radio_1'];
     $radio_2 = $_COOKIE['radio_2'];
     if (($username == 'm3s36G6S9v' && $password == 'S4er5h8QN2') || ($radio_1 == '2' && $radio_2 == '5')) {
     } else {
       $selected_username = $_POST['username'];
       $selected_password = $_POST['password'];
       $selected_radio_1 = $_POST['group1'];
       $selected_radio_2 = $_POST['group2'];
       if (($selected_username == 'm3s36G6S9v' && $selected_password == 'S4er5h8QN2') || ($selected_radio_1 == '2' && $selected_radio_2 == '5')) {
         setcookie("username", $selected_username, time()+3600, "/");
         setcookie("password", $selected_password, time()+3600, "/");
         setcookie("radio_1", $selected_radio_1, time()+3600, "/");
         setcookie("radio_2", $selected_radio_2, time()+3600, "/");
       } else {
        header("Location: login.html");
       }
     }
}
?>

I also added the following to the top tof the rss2html.php script

// Checks to see if this script was called by the main site pages, (i.e. index.php or mobile.php) and if not, then sends to main page
   session_start();  
   if(isset($_SESSION['views'])){$_SESSION['views'] = $_SESSION['views']+ 1;} else {$_SESSION['views'] = 1;}
   if($_SESSION['views'] > 1) {header("Location: http://website.com/index.php");}

Answer*

Cancel
4
  • thanks, but I do not want to completely block the Google, MS Bing, and Yahoo bots, but I want to limit there direct hits on the rss2html.php script file. I just need to add something to the beginning of the rss2html.php script that will prevent it from running if it was not accessed via the index.php script. The bots are currently running the rss2html.php script bypassing the index.php file.
    – Sammy
    Commented May 18, 2012 at 21:48
  • This doesnt block them.. you simply serve a cached version of your php.. this is very easy for a server todo, it's one less php instance/one less apache child process. => Cost(static file) < Cost(php instance).
    – smassey
    Commented May 18, 2012 at 21:56
  • how would I cache the pages? Since the pages are RSS, will the cached pages be refreshed often enough to provide fresh data?
    – Sammy
    Commented May 18, 2012 at 22:05
  • Of course... Write a cronjob which does it for you. If you say they hit the server 10req/s if you cache the pages for 1minute you've saved your server 599 extra php instances (which certainly include db connections/queries).. And once a minute is much more than what I would vote for: 10/15min.
    – smassey
    Commented May 18, 2012 at 22:41