Your Epic Fail: Fast or Slow?

In the load balancing world, many vendors have the concept of “sorry servers”, or “backup server farms/pools”.  Essentially, if most or all of your primary servers are down, traffic is redirected to a backup server(s) containing either reinforcements of the same web application, or a “sorry” page.

The idea is that if everything goes terribly wrong, at least your visitors will see something, instead of nothing.

Which begs the question: How do you like to fail?  Fail fast or fail slow? Would it be better to fail slow, where your site becomes slower and slower, or possibly just unresponsive, or would it be better to put up a quick-serving sorry page if the infrastructure melts?technical_difficulties

A wildly successful website can easily become a victim of its own success.  Take the case of two sites that experienced exponential growth in a relatively short period of time:  Twitter.com and Myspace.com.

They took two different paths in the realm of failure.  One failed fast, and one failed slow.

Although Myspace has lost most of its lead to Facebook, it’s still a wildly popular social media site.  They had exponential growth from their start in 2003, and there were many periods of time when Myspace.com was just… slow.  Really really slow. You can’t really blame them.  It’s tough when users come faster than you can install servers and provision bandwidth.  It’s a happy problem to have usually, but it’s still a logistical challenge.

Fail Whale

Twitter.com came around a bit later, but it also had exponential growth and problems coping.  But for the most part, they failed in a different way:  Fail Whale. When something went terribly awry, instead of a slow site, you’d get a very quick fail whale image.

Perhaps this is a matter of personal opinion, but I think if you’re going to fail, it’s better to fail quick than fail slow.  That is, have a sorry page or sorry site that comes up quick, rather than a site that is too slow for anyone to use.

The quick sorry page can be done with many of the load balancing/ADC vendors by using the backup/sorry serverfarm feature.  Keeping a group of reserve servers, serving up only a “oops, sorry about that” type of page, your own fail whale, can be better than having a really slow or unresponsive web site.

Of course, you won’t always be able to choose the method of your failure.  If your upstream ISP goes dark, there’s not much you can do (unless you have an offsite fail site).  But I personally think having a fail site is a more “professional” way to fail than having a slow or unresponsive site when things go belly up (and we all know they will).

About the Author