Outages at cloud providers appear regularly in the news and the reporting frequently unveils a relative immaturity and understanding about this IT delivery model. In June it was widely reported how Amazons PaaS offering called Amazon Web Services (AWS) and its customers were affected. There were examples of businesses that on the face of it appeared to be cloud based themselves, but had outsourced the provision to AWS. The IT press focused on the nature of the circumstances of the events that led up to the outage, and there appeared to be an ample supply of disaffected customers ready to speak up. Maybe the cloud industry itself is perpetuating some of the mystery that suggests that cloud is somehow different. But an IT infrastructure can be made highly redundant and provide sophisticated failover whether it be in-house or delegated to the cloud.
The following aspects stands out because they seem to repeat themselves regardless of cloud provider:
SLA: The SLA is the contract for the relationship between customer and supplier. And some customers didnt realise that they were on the on the wrong SLA until things went wrong.
Backup: There were reports of customers who had not initiated a backup of their assets. There is a parallel in the offsite/online backup market whereby customers also share backup responsibilities. Similarly, in the cloud provider model the SLA should spell that out and customer representatives must use their due diligence to ensure that nothing is left out.
Duplication of infrastructure: The AWS model is designed around redundancy at numerous levels such as across AWS data centres and in different geographies. This is where the worldwide Amazon setup can offer failover which would be very expensive if it was built from the ground up. However, Amazon has priced these optional premium features accordingly and some customers choose to not to adopt those due to cost.
Response times: As the impact of the outage became clear, the AWS engineers began to rebuild alternatives to the setup which had been stricken and that clearly did take time. During this phase customers complained that their websites suffered very bad response times. Being nominally online is not good, and the fall-back on the ability transact must be well understood and feature in the SLA.
Geographical load balancing: Amazon not only offers redundancy across its worldwide estate but it also offers a load balancing as a premium feature. Once the incident affected performance it could be hard to distinguish load balancing from being an essential contingency feature.
The incident in question refers to Amazon AWS, but it could have referred to a number of outages at other cloud providers, some of which have been given more publicity than others. A common trait in these incidents is that a significant proportion of customers only get confronted with the choices they have made, once a live outage occurs. The mantra in DR planning is to test frequently, and in the world of owner/user IT sites this testing is often not performed. Cloud based IT should also test contingencies, but here it is even less likely to take place. Both customers and suppliers have their respective responsibilities to perform in order to improve IT provision in the cloud. It would be a mistake to view the planning phase of IT supplied in the cloud as being less onerous than building your own – quite the opposite is true.
Image credit: tree-nursery / boomkwekerij by friedkampes