On July 30, 2012, Netflix open sourced the Simian Army, which includes the now world famous Chaos Monkey. That was 3 years ago. For those not familiar, Chaos Monkey is a little service that runs in the background and goes around killing EC2 instances. Yes, that's right:
Chaos Monkey terminates live servers with real users in production.
The question is...
Why is this scary?
Why would I, running a mission critical system, not want to be continuously testing the resiliency of my service? Why would I want to be woken up in the middle of the night by PagerDuty just because one machine died?
A wise man once said:
Hardware eventually fails. Software eventually works. -- Michael Hartung
This is true whether the hardware is in the public cloud, private cloud, or your own data center. At some point, it will stop working. When it does, it should be business as usual - a non-event.
Investing in high-priced hardware with redundant everything gives one a false sense of security. We have all heard the "perfect storm" stories when multiple impossible situations all occured simultaneously. No matter what precautions are taken, something will go wrong. With this approach, hardware failure is catastrophic.
Architecting for failure doesn't need to be hard. Just follow these simple rules:
- Be stateless
Actually, that's it - there is only one rule. Make your web and application tiers stateless.
- Got session data? Put it in Redis (ElastiCache) or DynamoDB.
- Got relational data? Hello RDS.
- Got a queue? Look for an AWS service with the letter "Q" in it. Don't like that? Iron.io is pretty nice too.
Notice that all of these are managed services. To build a truly cloud-native application, push the hard stuff (state) to someone else. Let an entire team/company dedicate their full time jobs to your state.
Once you have a stateless service, you are ready to put it into an Auto Scaling Group, and start terminating instances.
I'm not suggesting running Chaos Monkey 24/7/365 - even Netflix have it configured to only run during business hours. This type of resiliency testing is not meant to be painful, it is designed to be helpful - to catch issues before they really do become big issues at 3am.
Important to note: Auto Scaling != Elastic Scalability.
Start simple. If your service normally runs on 3 servers, run your service in an Auto Scaling Group with the minimum and maximum instance size set to 3. This ensures that at all times, exactly 3 servers will be online. This is all you need to be able to survice Chaos Monkey. If you want to scale out and in, great, but it's not a requirement. Min=Max is good enough.
Both AWS and GCE support Auto Scaling, and if you deploy your application using Spinnaker or it's enterprise version Armory, you will be using ASGs by default.
Chaos Monkey was never meant to be scary. It will help you build a reliable application. The ones who fear it have bigger problems, and will literally loose sleep over it.
Chaos is good.