This post comes in light of recent events in New Jersey and New York, hit by hurricane Sandy. Like Katrina, it has been a very difficult moment and is nice to see people help each other. Businesses too were affected by Sandy. They suffered power loss or loss of hardware due to flooding. Individuals and business alike will be changed forever.
While working for General Motors, I was given the opportunity to learn and work on disaster recovery and business resumption plans. This included researching tremendously in something I knew little about. To my surprise, a lot of horror stories came out of Katrina, many businesses effectively shutting down and liquidating. These business owners having written about their losses, hoping that others would learn from their mistakes. GM as you can imagine, has a significant amount of employees, business apps and data required to run day to day operations. If the headquarters is hit by a tornado or blocked by disgruntled union workers, how do we ensure continuity as if nothing happened? Working on the Disaster Recovery Plan (DRP) and Business Resumption Plan (BRP) was an eye opening experience for me.
Just to make sure I am not confusing anyone, DRP is a plan that is used to recover data and ensure that the tools used by the business are recovered. BRP is the plan that is executed when the physical business local is no longer operable and requires setting up remote locations to resume business as normal. Each business will have different requirements for resuming operations, including timelines and services that are crucial to operations.
I operate under the assumption that anything that can go wrong will go wrong and the edge cases, while rare, will also happen when you least expect it. For instance, who knew that of all things, a CAW blockade would require execution for the BRP for GM? Looking at Amazon over the past few months, they’ve had numerous large scale failures. Sandy has caused major disruptions and forced multiple websites and services to shut down as the backup generators ran out of fuel.
I’ve asked many small and medium sized business owners to describe their disaster recovery process. To my disbelief, most are unprepared or do not understand the severity of potential events. I live in a world filled with paranoia, so I asked them “what if your hosting provider disappears tomorrow?” which is often followed up by a puzzled look. Amazon could never crash right? What about pushing code to live the accidently purges live data? Or even an intern who runs a query that deletes data? Companies and developers are assuming that edge cases never happen because they pay attention and they can fix problems as they arise. They need plans for when things go terribly bad, even if it never will. I won’t try and claim that I haven’t made mistakes and that I have everything implemented, but I have the plans. Now if I had money to execute my plans, I’d perhaps be in a better position to convince everyone to follow my lead.
Regardless of your situation, you should plan. I won’t get into business resumption too much. Unless you have a decently sized company or a corporation, you won’t necessarily need it, your developers likely could work from home and be as productive as they are in the office. If you operate under VPN and have a variety of services in house, then you will more than likely need a BRP. I may get into that for another blog post if I get requests. Plan the implementation of the DRP as you get cash and the scale of which you deploy this plan.
A product or business could suffer a wide variety of failures that has the potential to disrupt services. These can include:
- Degradation of response times
- Server crash
- DNS issues
- Attack on services
- Major bug
- Provider / Infrastructure crash
- Provider / infrastructure unavailability (Natural Disaster, Fire, etc)
Degradation of response times
This is a pretty standard event that can happen to any application that starts to grow. While it is a nice problem to have, it comes with a high level of stress to developers and often ends up with frustrated users. I’ve lived through this plenty of times. Often the latency in response times comes from the database, but can also come from bottlenecks in the web nodes that have trouble keeping up with the amount of requests.
It’s often easier to debug and resolve issues with the web nodes. First place you can look is at CPU and Memory levels to see whether or not you have to add new web nodes. Hopefully you are already running your web nodes through a traffic manager or load balancer, like Riverbed Stingray as we use. Adding a new node should take you no more than 5 minutes, seconds if you have scripts. Be sure to look at server logs or error handling logs if you have any.
Database layer is fun to debug and I do mean that sarcastically. Debugging is different for each database technology you use. MySQL for instance will be far different than Couchbase. For MySQL turn on slow queries and find the latencies that way. If you are getting too many hits to your database, you should be growing your master node. The alterative is to have your master handle writes only and the slaves handle reads only. Eventually, your master won’t be able to keep up and you will need to shard your MySQL database. We are using VoltDB which retains the qualities and schema that many of us love from MySQL, but with capability to scale. VoltDB has a statistics engine built into it that can help you identify issues, such as finding slow procedures and problematic procedures or queries.
We also use a tool called New Relic extensively to help us discover the major latencies. This works in many coding environments and is extremely useful for monitoring your application and all the latencies within it.
Servers crash for unknown reasons, unless of course they are related to the degradation of the service due to heavy traffic. The machine could have a memory leak, it could have a fatal error, or may suffer a temporary glitch that causes it to go unavailable. Let’s face it, you don’t want to be at your computer staring at monitors all day long to ensure your servers are always healthy. There are plenty of monitoring apps that you can install that will send you notices or even call you while you are sleeping.
Any downtime should be unacceptable. Even if you are waking up to the alerts or watching the monitors to react accordingly, you will still have downtime. How do you avoid any downtime? Build redundancy! Most applications or services will have 3 stacks that can fail and that should be entirely redundant – load balancer / traffic manager, web cluster, database cluster . Ensure that if any of servers failure, there are other machines that are handling the requests while the failed machine is restored. We’ve had plenty of server failures over the past months, but we’ve had no downtime at all. The most extreme crash we’ve suffered was one of our database nodes with a faulty CPU, VoltDB has a K-Safety feature that means whatever value we set for K-Safety is the number of nodes that can fail, we had it set to 1.
We each have our technology preferences, but if it does not support redundancy, build it in. Your users and your stress levels will thank you.
DNS issues that are often harder to diagnose and chances are that by the time you’ve figured it out, the issue is resolved. DNS issues can happen closer to your servers and other times it can be an issue that affects your users. But there are cases where DNS issues linger and take a while to resolve. Getting around these issues can cost you money and usually not worth it unless you are considerably large.
DNS issues will often affect your availability to the world or certain zones, but your servers are still running and have no issues communicating with each other through vlan. If your cloud or hosting provider allows you to deploy machines in other zones, such as west coast instead of east coast, it can be an easy to bring your services back online in a relatively quick manner. Some companies are also utilizing multiple zones and may opt to turn off traffic from a zone, even if it increases the latency for some users, it ensures their availability.
Attack on services
There are numerous types of attacks, including DDOS, defacing, and attacks to destroy or stealing sensitive information. DDOS attacks are tough to avoid unless you have the money to invest in some firewalls or similar services like ones Akamai offer. There have been several large scale defacings over the past few months, including the NBC site a few days ago. SQL injection has been a very common method of destroying and stealing information. Your code and data should backed up regularly, you’ll likely need backups to restore services. If you are not restoring from a backup ensure you can audit your data with tremendous accuracy.
These days, most developers use SVN or GIT, restoring code should be straightforward and easy. Data on the other hand may be more difficult to restore, unless you have a good data backup procedure. You are more the likely going to lose some data, but not all attacks result in data being purged. In fact, you can ensure that accounts that are used by the web app don’t have permissions to delete rows, create tables, drop tables, etc. But that doesn’t necessarily stop from your data from being overwritten. Plenty of database technologies offer backup solutions, use them, you will need these backups some day, perhaps not only for an attack. Some database technologies offer automated and timed backups. Disk space is cheap, doing daily backups should be the bare minimum.
With your services restored, failing to solve the exploit your attackers used will put you at risk of a repeat and potentially worse attack. Investigate the entry point or get help. Don’t assume that once you’ve fixed one exploit there aren’t others lingering. Update your OS, update your software, and review your server logs carefully.
As much as we like to claim we test, something tends to slip through the cracks. Some are pretty minor and can be fixed in future pushes. But you get the nasty one that makes it to the live environment that does not scale or corrupts data. The way to fix these kinds of issues are very similar to those of an attack, where you can revert code from SVN or GIT or restore a backup.
These issues are perhaps more common and likely to happen. Before major releases, it is always advised to test on a staging environment. Such an environment should match a live on, but have independent databases and codebase. Prior to making the push a backup is perhaps the most important thing to do in case something does go wrong. If nothing does go wrong, you can delete the backup and celebrate.
Provider / Infrastructure crash or unavailability
Lately, we’ve been hearing about some rather big crashes from companies like Amazon. Amazon in the past year has suffered major downtime in a few instances, knocking out websites like Reddit for over 24 hours. Many choose to play it by ear with their provider, as frustrating as it may be to rely on a 3rd party for infrastructure, continue to evaluate their services and have an exit strategy.
Your product and business should not be at the mercy of another’s fault. If you have the ability to be fault tolerant across zones ensure that is working and disable the zone that is currently down. These are similar fixes to DNS issues except in these cases, you have no access to your servers. You are relying solely on your backups to bring yourself back up. If your backup solution is good, you can launch in another zone or with another provider.
The first suggestion I’d make to any developer or business is to invest in backups. While there is a cost associated to this, if you lose data or code, your business may not be able to continue after a major fault. Invest in the things that you don’t expect to ever happen, as confident as you are that it will never happen. Back up your data as often as you can, a day should be the minimum, but I consider an hour to be a more appropriate minimum. Keep one backup on site and have another one at least 200km out. Be sure you have a plan and you test restoring services from a backup. I’ve been in a scenario where the backup procedure was solid, but they couldn’t restore data from tapes.
Be sure to choose your infrastructure provider carefully. None come without their fault, but some do work on minimizing the impact of a failure. For instance, Joyent has fewer points of failure. No reslivering, no random mirroring to keep storage blocks available, issues like the ones witnessed at Amazon. I can’t get into all the benefits of various services, I don’t know them all. Rackspace is another that has done quite well in the cloud space as well. While I am happy with my provider, I still would not trust them with my data in the case of a major crash. After every backup, we take it and place it on other servers some 1400 km away. This alternative backup costs about $1 per 10GB, it’s cheap and can grow infinitely .
Be sure to build in redundancy. Put all your traffic through redundant load balancers or traffic managers. This includes your web at the bare minimum. If your database technology supports full redundancy, put it through your traffic manager. These traffic managers detect failures and when you need to scale, it takes seconds to add new nodes. Database reads can be fairly easily redundant, but writes can be more difficult, evaluate that you are using technology that will allow you to accomplish near 100% uptime.