I consider myself to be pretty good at scraping web sites. I've been able to get into sites others thought where impossible. I've even made it through some pretty tricky login verification. My tool of choice to accomplish this is VB.Net and a simple class I've written that re-implements much of System.Net.Webclient, and extends it to support a few additional functions. Unfortunately, I'm not ready to release that class here yet- there are still some issues with it I want to work out first.
The real key to scraping a web site isn't the technology, anyway. Scraping a given page, once you have it, is and always will be rather trivial. It's when you're scraping a site where you may have to make several requests in sequence to get the server to create the page you want that things can get tricky. With that in mind, the key to successfully scraping a web site is simply to study client source for the site until you can accurately reproduce http requests that are the same or sufficiently similar to those issued by a web browser under the command of a normal user. This may mean parsing some very nasty javascript now and then, but that's they way it works. Of course, there are tools that can help with this, but when it comes down to it you usually just need to be able to read the code.
Today I was helping someone scrape an ASP.Net site. This was my first time scraping ASP.Net, which surprised me considering it's my web platform of choice. I was also shocked to discover that ASP.Net can be unusually difficult to scrape. Perhaps in hindsight I should have known this, but it caught me unaware this morning.
You see, ASP.Net pages include a few extra things by default that must go with every request. The hidden __ViewState field, for example. The server normally does some basic validation on the application state, so just sending an empty view state may not cut it. Also, most server controls send requests using very cryptic IDs via an __doPostBack() javascript function. It's actually quite difficult to follow. More than that, since it's so easy to push the work for simple controls to the server, it's very easy to obfuscate what a particular link is really doing. So easy that you may even end up hiding things by accident.
I figure after I get a little more experience scraping these pages I'll discover there's a trick to it, and once you know the trick they may even turn out to be easier. In fact, I would expect that knowing a site uses ASP.Net would allow you to make certain assumptions about what fields you need to submit and how to submit them.
So if you have a web site and you want to protect it from scrapers, well... there's really not much you can do. Once a page is sent to a web browser a smart programmer will always be able to decipher it. But you could do worse than to choose ASP.Net.
Get out your tinfoil hats. It turns out that a small part of Windows was written by a machine. You have to read through most of a rather boring post to see what I'm talking about, and if you blink you might even miss it, but it's there. This isn't an official Microsoft statement, but it is an officially sanctioned blog of a senior Microsoft engineer. I'm sensationalizing this more than a little, but I'm sure there are those who will see this as a sort of slippery slope and wonder where it ends.
Sorry for missing my Friday update. I decided to put off the Vista review for another week. I should finally be ready to write it this time, though it may take long enough to compose that I don't post it until Monday.
So what happens if you build a custom 404 page, and something happens so the web server can't find it?