All in the <head> – Ponderings and code by Drew McLellan –

IWMW, Amazon Web Services and hKit

Today I attended the Institutional Web Management Workshop at the University of York to present about microformats. I was delivering a revised version of the Can Your Website Be Your API presentation I’ve given a couple of times over the last year, and failed to appreciate that each time I add new material it takes a bit longer to get through to the end. Who knew? Anyway, I managed to squeeze it into 45 minutes, and it seemed to be well received.

Presenting before me was Jeff Barr, Web Services Evangelist from Amazon. I first met Jeff at d.Construct last year where, unsurprisingly, he was presenting on the same topic of Amazon’s web services. Last time I was podcasting the session, and so could only give half my attention to the content of Jeff’s presentation, so was pleased to get the opportunity to properly soak it up this time.

My conclusion? Amazon have some really excellent, low cost services. S3 (the online storage service) I already knew about and understood to a degree. The concept is simple – you fling some files up into the cloud, and there they stay, stored redundantly on Amazon’s servers. What I didn’t fully appreciate was that access can be finely controlled through an ACL – meaning that not only can backups be safely kept private, but resources such as web assets or ‘downloads’ (software or podcasts or whatever) can be made fully public and therefore take advantage of Amazon’s high availability infrastructure. Of course, S3 charges on the basis of both storage space and data transfer (so you may want to think twice about using it to publish a freely available podcase, for example), but for things that really matter those costs seem very reasonable.

What I really missed last time (or perhaps the service wasn’t available or ready at that point) was the potential of their EC2 (Elastic Compute Cloud) service. This is basically a service where you can rent virtual servers by the ‘compute hour’. I’m not sure of the finer details of how that works, but the concept is that you can programmically bring servers online to perform whatever task you like, as you need it, just in time. That task can be almost anything – from performing a big batch job like processing a bunch of images, through to just providing an additional web server to help cope with load. The virtual servers have a good spec (something like 1.7GHz CPU, I think 1.7GB RAM and 160GB disc), and data between them and the S3 storage system is free. If you have a bunch of data on S3 you could bring up a EC2 server to grab it, process it and put it back and you only pay for the compute time, not the transfer in or out of either EC2 or S3.

For most standard web applications, it not often all that useful to be able to bring up an additional database server out in the cloud to help you with load. That sort of thing needs to be designed for from the start, and for a lot of applications just wouldn’t work architecturally anyway. Another option is if you were to host your entire application out on the cloud using a bunch of EC2 servers, full time. Depending on your needs, that could be cost effective compared to renting from a conventional hosting company. You do need to have quite a bit of trust in Amazon at that point, but I suspect many would consider Amazon more trustworthy than a lot of fly-by-night hosting companies anyway. The big advantage of hosting entirely on EC2, of course, would be that if you experience a spike in traffic you can just bring more servers online, right in the same data centre as your primary servers, and you only pay for what you use. Once traffic subsides, you can drop back down to normal. (It’s worth noting that this point that EC2 also has accounted for the need for multiple servers to share a secure local networking environment.)

This got me to thinking. For the last year or so, I’ve been hosting a service at tools.microformatic.com for people wishing to make casual use of the hKit microformat parser to extract microformatted data from a page. Pass in a URI and an output format (either plain text, serialised PHP or JSON) and the service fetches the page, parses it and returns the result. It’s very similar conceptually to how Technorati’s hosted version of X2V works.

Now this is all well and good, it works just fine for people running tests to validate that they’ve implemented a particular microformat in an understandable way, and it copes with reasonable traffic as we saw recently with the Last.fm shoutbox thing, which is another service on the same box. However, it’s not a redundant, scalable and utterly reliable system that you could start building applications on top of. So what if I were to reimplement this service on top of EC2? There’s no databases involved, in fact the service holds no data at all, so architecturally dealing with extra load should literally be a case of bringing another server online. Amazon claims to have 99.99% uptime on these things, which sounds pretty astonishing for such a low cost.

It certainly sounds like something that would be more reliable and dependable than my little server on its own, and possibly something that people would feel comfortable enough to build services on top of. With the cost from Amazon being as low as it is, it’s certainly in the realms of something that could be paid for by running a bit of advertising or perhaps seeking the odd bit of micropatronage.

EC3 is still in beta, but if I can manage to get access it might be something worth giving a go.