As mentioned in previous entries I have done quite a bit of work to block hackers from my WordPress sites and reduce the load on the server, but looking over the logs I still saw spikes of activity which I wanted to smooth out. A bit of IP sleuthing revealed the culprits to be various web crawlers. Since I’ve have screened out most of the malicious bots, some of these these are gathering information for the big search engines, many some are aggregating information for the hundreds of business directories and minor search engines. Some bad actors might be in there too, trolling for e-mail addresses in a not-too-overtly evil manner.
Clearly we don’t want to discourage Googlebots and Bingbots (or even minor aggregators) from crawling the site to index new content, but we do want to keep it to a minimum. We can do that by telling them which files to look at and which to ignore using sitemaps and robots.txt.
The sitemap is a computer readable list of all the pages, posts and media on your site. Robots.txt is file which all crawlers are supposed to read which list which directories should be indexed and which should be ignored. Unfortunately compliance is voluntary.
As part of my search engine optimization I have created XML sitemaps of all the sites and submitted those to Google and Bing webmaster tools. The plugin I like to use is WordPress SEO by Yoast. WPSEO is great for managing the meta data for individual pages and it automatically creates sitemaps and pings the search engines when you add to a site. In addition to listing the URLs of your site a sitemap also assigns a priority (from 1.0 to 0 – home page is usually 1.0) and change frequency (daily, weekly, monthly, yearly or never). Unfortunately you can’t edit these attributes of the sitemaps with Yoast’s plugin.
Yoast sets all posts and pages to be re-crawled weekly. That’s too frequently for some content, in my opinion. I disable the sitemap component in Yoast’s SEO and install Google XML Sitemaps plugin which allows me to edit the site map.
The main page is set to daily, but static pages and old posts can be set to monthly or even yearly.
We can submit our sitemap URL to the major search engines but what about all those pesky little search engines? That is what robots.txt is for! The robots.txt goes in the root directory of your site. Perishable Press offers us a good starting point for building the file:
User-agent: *
Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /xmlrpc.php
Disallow: /wp-
Allow: /wp-content/uploads/
Sitemap: http://example.com/sitemap.xml
This file suggests a list of WordPress directories which should not be crawled and one that should. At the end you place the absolute path to your sitemap. Good bots will comply with these guideline and you should see a marked decrease in unnecessary crawling.
But if it’s voluntary what about the bad guys? Can’t they look at this as a map of places you don’t want them to go and just go there anyway? Yes they can, but the White Hats at Perishable Press came up with a little trick they call the Black Hole for Bad Bots. I have installed this on a couple of sites.
As it turns out the PHP is outdated and it doesn’t run properly. Sadly the author does not seem to be supporting it. It does work to capture the IPs of offending crawlers, but it does not bar them from crawling the site. For now I will leave it installed and use it as a source of bad IPs to add manually to my .htaccess file using Bulletproof Security.
After a couple of weeks running all my sites using the above protocols. I have been able to lower the RAM allocation on the VPS by one third with no resets.