seo blog

Varied posts about website promotion, seo and more subjects from the editor of the directory of seo links

Archive for the ‘security’ Category

Automatic defending script against bad robots

  • Filed under: security
Friday
Dec 28,2007

Last week a Hungarian guy asked me if I could develop something effective solution against bad robots so some days ago I started to execute a fast survey on topic and found many solution, but most of them based on certain host placed into .htaccess, and none of them was automatic, so the challenge was given.

Some days ago I started a little survey on topic and found a lot of htaccess rules, where certain hosts were rejected via .htaccess, but they were not automatic, so the challange was given. The most useful site I found was this resource which let me know the basic attitude of bad robots to the robots.txt files. They ignore the specified restrictions.

1. Open your existing robots.txt file or upload one and place the following lines into it

User-agent: *
Disallow: /core

The name of the restricted folder is not important, but would be great if the humanoid atteckers would find it enough attractive as well since this folder will be the live-bait.

2. Create the folder on your hosting space which is specified in the robots.txt file, in my example this is called core and upload an index.php file with the following content:

<?php
$ip = $_SERVER[”REMOTE_ADDR”];
$logfile = ‘bannolnilog.txt’;
//collect the IP adresses or something else into the logfile
$fp = fopen($logfile, ‘a’);
fputs($fp, “$ip
“);
fputs($fp, ” “);
fclose($fp);
echo “your IP was logged for security reasons and your visit is now over”;
?>

3. As you may see in the code I defined a $logfile where the IP adresses will be collected and stored hence we need to upload to the same (core) folder a blank txt file called bannolnilog.txt (chmod 644).

4. We need to upload one more php file which will check if the visitor is bannished whenever a page is requested, I named this file validator.php and its content is the following.

<?php
$ip = $_SERVER[”REMOTE_ADDR”];
$logfile = ‘bannolnilog.txt’;
$target = file(dirname(__FILE__). “/core/bannolnilog.txt”);
foreach($target as $item){
$item = trim($item);
if(stristr($ip, $item)){
header(”HTTP/1.0 403 Forbidden”);
exit;
}
}
?>

5. As final step you need to insert this line into the very front of your script header or index file, the point is that this is how the script must started whenever a page is requested

<?php require “/you/need/to/insert/the/path/here/validator.php”;?>

Note: You may truncate the logfile deleting the collected IPs, and please take into consideration that WordPress is make quotation marks display a bit odd, so you may want to double check the syntax of the code.
I warrant nothing, but works very well at one of my sites.

Have a nice further day!

Monday
Sep 17,2007

Short story

The problem was given. At the end of the linked posts there is an advised “reverse cloaking” solution link, but all I achieved after implementing that was the Unreachable network error at Google Webmaster Tools.

Days were gone, but finally I found this thread at WebmasterWorld.com. IncrediBILL suggested a solution which based on the reverse-forward DNS robot validation with the following php script:

// Get the user agent.
$ua = $_SERVER['HTTP_USER_AGENT'];
// Check the user agent to see if it's identifying itself as a search engine bot.
if(strstr($ua, 'msnbot') || stristr($ua, 'Googlebot') || stristr($ua, 'Yahoo! Slurp')){
// The user agent is purporting to be MSN's bot or Google's bot or Yahoo! Slurp.
// If the user agent string is spoofed, we won't find googlebot.com in the host name.
// Get the IP address requesting the page.
$ip = $_SERVER['REMOTE_ADDR'];
// Reverse DNS lookup the IP address to get a hostname.
$hostname = gethostbyaddr($ip);
// Check for '.googlebot.com' and '/search.live.com' in hostname.
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/crawl\.yahoo\.net$/", $hostname)) {
// The host name does not belong to either live.com or googlebot.com.
// Remember the UA already said it is either MSNBot or Googlebot.
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
// Now we have a hit that half-passes the check. One last go:
// Forward DNS lookup the hostname to get an IP address.
$real_ip = gethostbyname($hostname);
if($ip!= $real_ip){
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
// Real bot.
$block = FALSE;
}
}
}
?>

The original script didn’t validate the Yahoo Slurp bot, but I additonally completed the script with it.

So all you need to do is to download the installation package and implement them according to attached guides. If you are a WordPress I have a really good new for you. Due to mosquito a WordPress plugin is also available with guide as well.

Yes, and don’t forget about the testing. Open your Firefox browser and insert “about:config” without quotes into the adress bar and press enter. Press mouse right click, select new string and add “general.useragent.override” as name and “Googlebot/2.1 (+http://www.googlebot.com/bot.html)” as value. Refresh your site after implementing the defending script/plugin and you will see exactly what the robot will see when come through a proxy site.

Downloads

Installation pack
google proxy defending

WordPress plugin
google proxy defending