Although vendor-written, this contributed piece does not advocate a position that is particular to the author's employer and has been edited and approved by Network World editors.
Webscale pioneers like Netflix, Google, Amazon and Etsy have made a science of breaking their own applications and infrastructure so that they can determine if their application and operations architecture is complete and robust. While few IT shops run their apps and infrastructure in the same way as these behemoths, valuable lessons for CIOs and CTOs of all stripes can be found in the innovative practices these companies have created.
In a theoretical sense, webscale companies have a simpler problem to solve than most organizations. These players run one or a few massive services, and while there are lots of components to these services, they are generally better understood and built to work together, unlike what you find in traditional enterprise IT. A typical shop has dozens of interacting components with dependencies that are often either not documented or even widely understood.
The only way that webscale companies manage tens of thousands of servers is by automating absolutely everything. Webscale companies are also usually very disciplined about making their development and test environments identical to their production environments in as many ways as possible. Because, for the most part, webscale companies practice DevOps, they are developing operational procedures, searching for vulnerabilities, and creating automated responses all through the development and release management cycle.
It is important to remember, however, that any webscale development or test environment is really just a shadow of the production environment, which contains complexities and scale that cannot be replicated. When you deploy in such an environment, it is always risky.
The ravages of Chaos Monkey
Chaos Monkey is the original tool Netflix used to reduce risks in its production environment.
Chaos Monkey randomly shuts down servers, services and other components to make sure that failure does not lead to any disruption to users. In practice, Chaos Monkey tests two things:
- The ability of the application and operational architecture to conceal failure
- The quality of automated responses for recovering from failure
Ideally, Chaos Monkey should be able to wreak havoc all day long and users should never notice, simulating a process of continuous disaster recovery. When Chaos Monkey does cause a problem, either the application and operational architecture or the automation of disaster recovery must be fixed.
Netflix also now has its own simian army, including Chaos Gorilla, which takes out an entire availability zone. (To read more about havoc that can be wreaked by other members of the simian army, such as Latency Monkey, Conformity Monkey, Doctor Monkey and Security Monkey, see the Netflix blog on that topic.)
Each of the webscale players has it own bag of tricks for being its own worst enemy. When I was at Google, we used firewall rules to simulate network outages. Etsy has developed a huge arsenal of automated tests so it can deploy changes many times a day, confident that any problems will be quickly found. Loudcloud encouraged discipline by offering a 100% SLA with financial remedies. Amazon doesn't talk much about what it has learned. I wish it did.
What IT can learn
The typical IT shop could usually benefit from acting more aggressively as its own worst enemy. Most of the time, IT operations are based on manual checklists. When a server goes down, there is a procedure written down somewhere for bringing it back up. As anyone who has any significant experience knows, it is one thing to have a checklist. It is another thing to have a well-tested and consistently used checklist, in the same way as it is one thing to have a backup and another thing altogether to have used that backup to successfully restore key applications.
As a first step, IT can be more aggressive about its manual procedures in the following ways:
- When you cause an outage, do it violently. Don't shut down a server step by step. Pull out the network cable or shut off the power. At Google we used the firewall to inflict such violence.
- Don't have someone who owns the service cause the outage. Often, they will take it easy on their baby.
- Have someone who is not involved in the service run the restore procedure using the checklist. This is how you can increase quality.
- The more your development and test environment matches your production infrastructure, the more of this work you can do in dev/test.
Creating effective, tested manual procedures is the prerequisite for automation, which is a much stronger foundation for disaster recovery than manual methods.
Practice being your own worst enemy
Google has an annual companywide ritual called DIRT (Disaster Recovery Testing), dedicated to finding vulnerabilities and improving manual and automated responses. Every year, the program grows in scope and quality. In essence, even at a place like Google, where everything is highly automated, the company has to practice being its own worst enemy to be good at it.
Focus automation of disaster recovery on the most mission-critical and fragile parts of your apps and infrastructure. Automation not only allows for a faster response to outages but it also helps you spin up new servers faster in response to scaling events.
Once you have automation in place for all the types of outages you can foresee, then it's time to unleash your own simian army. Start out in the test or dev environments until it is hard for you to destroy your app with an outage. Next, hold your breath and try it in production during a maintenance window or at some other low-risk time. When doing this, make sure you include all perspectives. Test from the end-user perspective, not just the interaction between servers.
The most important thing is to take the time to be your own worst enemy. Doing so can make the difference between an outage that lasts a few minutes to one that lasts hours or even days.
Puppet Labs' IT automation software enables system administrators to deliver the operational agility and efficiency of cloud computing at enterprise-class service levels, scaling from handfuls of nodes on-premise to tens of thousands in the cloud. You can learn about Puppet Labs here: https://puppetlabs.com.
Read more about data center in Network World's Data Center section.