One standard form of information discovery and reconnaissance used by malicious attackers is to scan a target website and search for robots.txt files. The robots.txt file is designed to provide instructions to spiders or web crawlers about a site's structure and more importantly to specify which pages and directories the spider should not crawl. Often these files are used to keep a spider from crawling sensitive areas of a website, such as administrative interfaces, so that search engines don't cache the existence of such pages and functionality. It is precisely for this reason that a malicious attacker will look in a robots.txt file - they often provide roadmaps to sensitive data and administrative interfaces.
Knowing that malicious attackers might look into your robots.txt file and explore the listings there allows you to employ a few defensive techniques, or at least provide some early warning measures. One possibility is to simply waste an attackers time. For instance, if your site has an administrative interface at /admin you might want to list a couple hundred non-existent sub-directories and sift /admin into the list near the middle or end. This would provide frustrating for an attacker looking through the robots.txt entries by hand. If an attacker was using an automated tool, however, they likely won't be slowed down by false entries in the robots.txt file.
The system I'm describing can be implemented in a number of ways. The basic idea is the same though. You fill your robots.txt file with numerous false entries. Each of these false entries leads to a server response that triggers a blacklisting of the offending IP address. This means that real subdirectories and files can still safely be embedded in the robots.txt, but the time to search each entry becomes exhaustive for an attacker.
In principle the system functions in a fairly straightforward manner. Assume we have an administrative login page at /admin that we want to hide from attackers. We create a robots.txt file that contains the following entries:
User-agent: *
Disallow: /administration
Disallow: /login
Disallow: /restricted
Disallow: /files
Disallow: /customers
Disallow: /control_panel
Disallow: /administer
Disallow: /admin
Disallow: /cms
Disallow: /backend
Assuming someone requests any of these pages (except /admin) their IP address is added to the firewall deny listing for a set amount of time. We don't want to permanently ban IP addresses because this then creates a denial of service opportunity for an attacker who could spoof legitimate IP addresses in requests for bad pages. A ten minute ban is usually sufficient.
Using this method, and the robots.txt file above an attacker with a static IP address must spend at least 70 minutes in order to discover the /admin part of the site. This is a resource exhaustion defense. Of course, an able attacker will use some mechanism to shift their IP so it helps to have quite a few listings in this robots.txt honeypot (a dozen or so is clearly insufficient to slow down a determined attacker).
Now, to implement this functionality we need to do two things. The first thing we need to do is create a custom script that will add offenders to our blacklist. For the purposes of this example we'll assume a LAMP based site. We need a script that will add offenders to our iptables. We'll put all of the scripts for this task in /var/www/robots_honeypot. This directory and the scripts should be writable and executable by the web server. The IP blocking script, in it's simplest version such a script looks like:
#!/bin/bash
#
# blacklist_ip.sh
#
IP=$1
IPTABLES="/sbin/iptables"
ARG1="-I INPUT -s ${IP} -j DROP"
ARG2="-I FORWARD -s ${IP} -j DROP"
sudo ${IPTABLES} ${ARG1}
sudo ${IPTABLES} ${ARG2}
# log the entry
echo "`date +%s` $1" >> /var/www/robots_honeypot/robots_honeypot.log
When called with an IP address as an argument this IP will be dropped from the firewall. Now, this script doesn't have any capability to de-list an IP address, so you'll have to implement another mechanism to do that. The easiest way to do this is to log when IP addresses are added to the firewall and then run another script via cron that cleans them out. The cleanup script is fairly simple and reads and manipulates log files in /var/www/robots_honeypot.
#!/bin/bash
#keep robots_honeypot.log.expired as a permanent record
mv /var/www/robots_honeypot/robots_honeypot.log /var/www/robots_honeypot/robots_honeypot.log.working
cat /var/www/robots_honeypot/robots_honeypot.log.working | while read line;
do
set -- $line
time=$1
ip=$2
current_time=`date +%s`
time_diff=$((current_time-time))
#timeout set to 600 seconds (10 minutes)
if [ "$time_diff" -gt 600 ]
then
#blacklist time expired, drop firewall rule
/var/www/robots_honeypot/unblacklist_ip.sh ${ip}
echo "${time} ${ip}" >> /var/www/robots_honeypot/robots_honeypot.log.expired
else
#time not up, return record to the log
echo "${time} ${ip}" >> /var/www/robots_honeypot/robots_honeypot.log
fi
done
rm -f /var/www/robots_honeypot/robots_honeypot.log.working
This script calls the following script that runs in the same directory and removes ip addresses from the firewall rules:
#!/bin/bash
#
# unblacklist_ip.sh
#
IP=$1
IPTABLES="/sbin/iptables"
ARG1="-D INPUT -s ${IP} -j DROP"
ARG2="-D FORWARD -s ${IP} -j DROP"
sudo ${IPTABLES} ${ARG1}
sudo ${IPTABLES} ${ARG2}
You can test the entire setup by first adding an IP to your firewall block list by running blacklist_ip.sh like so:
$ /var/www/robots_honeypot/blacklist_ip.sh 192.168.1.111
and you can check the result by looking at your iptables rules with:
$ iptables --list
Next you can test the remove_black.sh script that will be running via cron by putting an entry into your /var/www/robots_honeypot/robots_honeypot.log with:
# echo "`date +%s` 192.168.1.111" > /var/www/robots_honeypot/robots_honeypot.log
Now try running the remove_black.sh script with:
# /var/www/robots_honeypot/remove_black.sh
And checking your firewall rules with
# iptables --list
If you run these commands quickly you should see the entry remains in the firewall rules. You should also note that the entry still exists in /var/www/robots_honeypot/robots_honeypot.log:
# cat /var/www/robots_honeypot/robots_honeypot.log
1227620641 192.168.1.111
The script waits for 600 seconds (or 10 minutes) to elapse. Once this time has elapsed then the script will remove the listing from /var/www/robots_honeypot/robots_honeypot.log and place it in /var/www/robots_honeypot/robots_honeypot.log.expired as a permanent record. There are a number of other ways you could keep a permanent record of who has tripped over the robots.txt honeypot, however.
Assuming all of these scripts are in /var/www/robots_honeypot we can schedule a cron job to take care of them. To do this just add the following to the appropriate crontab:
0,10,20,30,40,50 * * * * /var/www/robots_honeypot/remove_ip.sh
The final step in this entire equation is a trigger on the robots.txt file content. This is a little trickier and will rely on some PHP as well as a custom 404 page that we can use to catch any and all bad requests. You will have to consult the documentation for your version of Apache in order to find the particular modifications you must make in order to call a custom 404 error page. For the purposes of our example let's say we change our 404 error page so that it points to /var/www/error/custom_404.php. We then implement the following script as custom_404.php:
This script will add the offender to the correct log file and email the administrator of the offense.
Once we're done with this we're nearly set. The last hassle is to make sure the security permissions are in place for this script to run. In order to do this we'll need to utilize sudo, so that the webserver (in this case the apache account) can make changes to iptables. To do this we have to add the following line to our /etc/sudoers file:
apache localhost = NOPASSWD:/sbin/iptables
This will allow the apache account to issue iptables rules without a password. Be aware that doing this could introduce certain security concerns that you should carefully examine before implementing this solution. For instance, if you use this method and your web server is compromised, then an attacker could manipulate your firewall!
Once all of this is in place your blacklist should be up and running. Now when attackers attempt to scan the contents of files listed in your robots.txt they face the possibility of being blacklisted.