I don’t see why anyone would use anything but Google to begin with, but for those of you who use MSN Live as your search engine, you might want to rethink it. Here’s why:
Search engines use what are called “spiders” or “bots” to index web pages for their search engines. What these spiders do is crawl the Internet, downloading the content of random web pages to be included in their search databases. They will get the home page of a site, find all the links in the page, and download the content of those links. This is all fine and dandy…if they follow the rules.
Since some people don’t want certain parts of their website to be searchable, they can give a spider visiting their site a list of places they are and are not allowed to go. These rules are put into a file called robots.txt (to view my rules, take a look at http://andyonline.org/robots.txt). All major search engines claim to follow these rules, including MSN. However, I have caught MSN breaking the rules on more than one occasion.
Search engines are not the only people who use spiders. They are also commonly used by the baddies of the Internet. They search web pages for email addresses to send spam or to find sites that are vulnerable to hacking. These bots rarely follow your instructions.
There are various techniques that I utilize to prevent bad spiders from accessing this site while still allowing legitimate visitors. One of those is a “bad bot” trap. There is a link on this site that is only visible to spiders, which I instruct them not to visit. If that link is visited, it bans that computer from visiting my site in the future. This link points to http://andyonline.org/bot-trap/ (for the love of God, don’t go there or you won’t be able to access this site again).
My bot trap has caught the MSN Live spider…twice.
Here is the current content of my robots.txt file:
User-agent: *
Disallow: /wp-admin/
Disallow: /podcast/
Disallow: /ua/
Disallow: /bot-trap/
“User-agent: *” is a blanket statement meaning “any spider visiting this site”. The next four lines detail the folders that I don’t want crawled.
Here is an excerpt from my site access log from early this morning:
65.55.106.112 – - [14/May/2009:02:31:33 -0600] “GET /robots.txt HTTP/1.1″ 200 91 “-” “msnbot/2.0b (+http://search.msn.com/msnbot.htm)”
65.55.106.112 – - [14/May/2009:02:32:29 -0600] “GET /bot-trap/index.php HTTP/1.1″ 200 1892 “-” “msnbot/2.0b (+http://search.msn.com/msnbot.htm)”
The funny thing is that the MSN spider grabbed my robots.txt file right before it got itself banned.
This exact scenario played itself out last October. I unbanned their spider and tried to contact MSN tech support to tell them to stop being jerks, but their support system is a tangle of help pages and canned responses. Their basic response was “Just deal with it”. They said that I must have recently changed my robots.txt file and their spider hasn’t caught up with the changes. I call shenanigans. I haven’t modified my robots.txt file since August 2008. For the first incident, it hadn’t changed in 2 months, the second time, it hadn’t changed in 9.
The MSN Live spiders blatantly go where they don’t belong.
I’m done cleaning up the mess from the MSN spider plowing through like a bulldozer in a sandbox. I am no longer unbanning the MSN spider when it goes where it doesn’t belong. It should know better. Will it affect this site showing up in MSN Live Search? Possibly. I’m not too worried about that though. MSN still has plenty of other unbanned spiders still happily crawling away at this site. Plus, most of my search engine traffic comes from Google anyway. Over the last 30 days, traffic to this site originating from Google out numbered traffic from MSN by 8:1.
I’m not the only one who uses a bot trap. How many other web sites have blacklisted the MSN spider? How much information is unavailable through MSN Live Search because of their behavior?
MSN uses unfriendly tactics when building their search database. You shouldn’t support their behavior.
Don’t use MSN Live Search.