Many businesses have woken up today to the reality that cloud providers too can fail. Not many had adequate back-up plans. How do you recover?
Among the plethora of analysis since AWS’ outage in late April, we found the most practical tips from Wired:
To ensure that each system can stand on its own, Netflix uses something it calls the Chaos Monkey (no relation). The Chaos Monkey is a set of scripts that run through Netflix’s AWS process and randomly shuts them down to ensure that the rest of the system is able to keep running. Think of it as a system where the parts are greater than the whole.
The photo sharing site SmugMug has also detailed its approach to designing for failure and why SmugMug was largely unaffected by the recent AWS outage. SmugMug’s Co-Founder and CEO, Don MacAskill, echos Netflix’s redundancy mantra, writing, “each component (EC2 instance, etc) should be able to die without affecting the whole system as much as possible. Your product or design may make that hard or impossible to do 100% — but I promise large portions of your system can be designed that way.”
This is a significant shift in approach for most architects and designers. We have added this topic for discussion at the next monthly roundtable, and look forward to learning more about transitions in the real world!
Image credit: Fall Colors by Bahman Farzad