WhiteSites Blog

Protecting your website from Rogue Spiders

Posted on Apr 13, 2009 by Paul White

Here is the easiest way to tell the difference between spiders and real visitors.
1. Spiders create a new session on every request.
2. Real Visitors can maintain session state.

So the question becomes how should you manage Spiders.  Some of them you want to crawl your website, such as google, yahoo, msn, baidu, ext...  Some are nothing more than hackers / spammers, attempting to scrap your website for photos, emails, articles.  Its bad enough that these bad spiders, are trying to steal content, but even worse is they usually don't crawl your website like a traditional search engine spider.  They will goto the first page on your site, find all the links on that page, and begin tearing through each page at a very fast rate.  This would be fine if you had a static website, that could handle such a load, but if you are running any kind of dynamic website, this causes, requests to pile up in your request que.  This also may do things like open up too many DB connections to SQL, bringing your website to hault.

How do I stop bad spiders and bots?

First you have to take things into your own hands.  The example I will show will work for ASP.NET.  The basic logic should also work for PHP, and other server platforms, but don't ask me for code examples.

First you need to maintain a table in SQL / MySQL that will store all the sessions that have been created.
This table should store, the Session ID, DateTime Created, and IP address.  You can also store the user agent, if you want a little more info about the visitor.  Next you will want to tie into the Session_Start event of the global.asax.  Everytime a new session is created, you need to log it into the table you created earlier.  Then you will add some code to your Application_BeginRequest Event.  This code should check the table to see how many entries can be found for this given IP within the last 24 hours.  If it finds more than lets say 20.  Blacklist the IP address.  But don't just blindly blacklist them.  Have your server email an alert with the details of the visitor.  This will allow you to further research this IP and determine if it is a false positive.  But this will also catch the good IPs.  So you need to maintain a whitelist of IPs that should be given unrestricted access to the site.  Never give access to your site based on User Agent.  User agents can be changed, and cloaked.  IPs not so much.  They could use a proxy to get around this, but of course after 20 requests, this IP would get caught by your trap. 

Permalink
2039 Visitors
2039 Views

Categories associated with Protecting your website from Rogue Spiders

Discussion

No Comments have been submitted
name
Email Needed to confirm comment, but not made public.
Website
 
 
When you Post your Comment, you'll be sent a confirmation link. Once you click this link your thoughts will be made public.. Posts that are considered spam will be deleted, Please keep your thoughts and links relavent to this Article