As a co-admin of a forum and a blog WordPress, I found this post useful and want to move it into my blog. Today there are many books and inspirations that tell us the right way to do things, but life isn’t easy, so, learning from the losing experiences is much more effective than “traditional” inspiration.
You can found the original post here:
A high availability website: It’s all about expensive redundant hardware with top of the line load balancers and an enterprise class SAN, right? Well, not necessarily.
There are several cheap or free steps you can take to ensure uptime before you splash your cash on lots of kit. If you’re not taking these steps you are wasting your money on your redundant hardware. Here are my top five things that bring sites down, in rough order of likeliness …
- Unexpected load
- Slow death
- Time related issues
- Hardware failure
Starting from the top, here is my explanation of each of these categories, together with ways to minimise the chance of it happening to your site.
This is the biggest category here. If your website was just fine yesterday, and today it’s bust, you probably changed something.
Maybe you just released a new version of your software? Upgraded a third party product? Or changed the network configuration? Even minor, apparently safe, changes can have unforeseen side effects. So the number one measure you can take to maximise your site uptime, is the unsexy, dull and constraining world of change control.
There are some very basic things to get right here: test your software before you release it. Test it properly even if you think it’s obvious it’s going to work. In fact you also need to test releasing it, which means you need development and test instances of your application, as well as your live one.
Develop your changes in the development environment, and work out the release plan – including how you would reverse the change if you needed to. Follow that release plan in the test environment and test the result. Test it really thoroughly because it’s much easier (not to mention much less stressful) to fix it here than once it’s live – you are not just testing the release, but also the release process itself, so the test environment should be as like the live one as possible.
This is pretty basic stuff, and there’s no excuse for not doing it, however small your company is, especially with virtualisation making it easy to run multiple environments on one physical server.
Change control is about three simple things:
- Giving some thought to changes you make to your live system and their potential risks
- Making changes one at a time so if something breaks you know what the last change was
- Writing down what you changed and why so that everyone can see.
Get this under control and you will have avoided the vast majority of your potential outages.
#2: Unexpected Load
This is the Digg/TechCrunch effect. Someone writes something nice about your web app, the buzz spreads, and before you know it you have an order of magnitude more traffic to your website than you ever planned for and the whole thing melts.
So what can you do about it? Well of course you could buy racks of hardware just in case, but that’s not really practical. Here are some more realistic suggestions:
Know your capacity. This involves performance testing your application in advance. Set a level for what you consider to be an acceptable response time, and ramp up simulated users carrying out typical tasks until you exceed that threshold. From here you can establish performance bottlenecks and tune your application to increase that capacity (how to do this is a massive subject in its own right), but even without this tuning, you will at least know what kind of spike in load you can support.
It is also worth investigating on-demand services such as Amazon EC2 to increase capacity at short notice either in anticipation of or in response to a big traffic spike.
Spread your PR. A big bang PR launch of your new site could leave your reputation in tatters just as everyone’s eyes are on you. But is the big bang really necessary?
If you have developed a major upgrade to your web application, then it may well be very newsworthy and you may want to tell your entire userbase about it as soon as possible. But the last thing you want is every single registered user hitting your site in the hour following your newsletter, and if it is exciting news for your users today, it will still be exciting news for them tomorrow or the day after.
Segment your user base, tell them the exciting news a segment at a time, and use the early data to determine whether you need to speed up or slow down subsequent mailings. The same goes for initial launches of your site: look at options to make announcements in one geography at a time or launch to a limited initial user base.
Degrade gracefully. It is better to give full service to a percentage of your users, and a helpful static message to the rest, than for your entire site to be entirely unresponsive. You’ll need to specifically code for this.
In general a production web application will be constrained on the number of requests that can be effectively processed at any one time – and you should have a good idea of what the upper limit is from following the “know you capacity” advice above.
For a fixed number of incoming requests per second, the number of requests in progress at any one time depends on how long they take to process: 10 requests a second that take 1 second each to process means that you will have 10 requests in progress at any one time. If it starts to take 2 seconds to process each, then you will have 20 requests in progress at any one time, which will probably make your requests take slightly longer still. This slowdown continues until you reach a tipping point where your service grinds to a halt, and any responses that do get returned take much longer than the browser timeout.
There are a few solutions to this, ranging from a simple restriction on simultaneous connections (pick the largest number you know you can handle) to queuing requesting for asynchronous processing rather than tying up a thread for each (detail of these is beyond the scope of this article). Do investigate and implement a suitable solution for your technology stack.
#3: Slow Death
This is the slow incremental use of a resource over time that goes unnoticed until one day you hit a critical limit. The obvious contenders here are disk space and memory (being eaten up over time by a slow memory leak).
This category of outage is easy to avoid with a little forward planning. You need to monitor, set alerts and watch trends. Have your system SMS you when available disk space gets below 20% for example.
In theory modern garbage collected languages like Java and C# make memory leaks much less likely than the direct memory allocation of C and C++, but they are still possible: watch for memory allocated by static classes, caches that are not cleared, or more traditional memory leaks in third party middleware.
Track memory usage and watch for trends. If you see issues then the short term solution is to restart the offending component, but in the long term you need to track down, isolate and fix the problem.
#4: Time Related Problems
If your website was working fine yesterday but is broken today, one thing that has definitely changed, and is most definitely outside your control to stop, is the date and time.
The biggest threat is daylight saving. You code your new feature in the winter, test it thoroughly and put it live, only to find problems when the clocks change in the spring. Most of the time the bugs this creates are not ’site-down’ problems (maybe time data in your app is out by an hour throughout), but I do remember one problem at lastminute.com where some badly written date calculation code went in to an endless loop every night between midnight and 1 am once daylight saving started, bringing the hotels search down.
The number one rule here is to make sure that your code never gets the current system date and time directly, but calls a mockable alternative, so the production implementation gets the date and time from the system clock, but you can write unit tests to test the behaviour for a sensible set of dates, times and times zones.
The other gotcha is licence expiry. Make sure you know when your crucial licences expire, and create a reminder in your task management system of choice to make sure you renew in plenty of time.
#5: Hardware Failure
Things with moving parts break, and in the server world that means primarily disks and fans. Disks hold your data so are kind of crucial, over and above mere uptime, so make sure you have RAID redundancy (and appropriate monitoring so you can replace a broken disk before a second one breaks), and backups at least daily with copies offsite.
If you’ve done everything else on this list, and have some money to spend to reduce the risk of hardware failure, then go for it in this sequence:
- Add a load balancer and scale out the web tier, to give you both increased capacity and redundancy.
- Mirror or cluster your database on to a second DB server. Likewise for any critical files on the filesystem: use a SAN or replicate between servers.
- Set up in a second data centre – either as a DR fallback or operate in active-active.
Of course spending cash to ensure site availability is wasted if the failover doesn’t work when you need it, so plan carefully and test it. With redundant hardware in place, it will protect you from more than just hardware failure. It allows you to direct traffic away from an instance the seems to have locked up while you restart it, or perform operating systems upgrades that require a restart without site downtime.
In Conclusion …
This isn’t an exhaustive list of things that might break your site, but it represents a good summary of outages I have come across. (The big category I have not included here is failure of some service provided to you like power in your data centre). However, dig in to what caused a failure and it’s probably one of the items on this list, most likely someone changing something.
As I mentioned at the very beginning, hardware failure is last on my list, so do not jump straight in to spending money here until you have dealt with the other categories here, and the single most important thing you can do is control change.
In my days running online development at lastminute.com, I was talking to the head of technical operations over the Christmas period, and I happened to comment that the site had been remarkably stable. He quickly replied, “Yes, that’s because most of your guys are on holiday, so no one’s been meddling. If I sent my team on holiday too we’d have 100% uptime”.
It’s true: IT systems that no one touches don’t break very often. Not never, but not very often. That said, change is clearly a major part of a successful website, so make sure you are confident of the changes you make and how to undo them, and you will be on your way to having a stable site.