The real meaning of outages; coming full circle on resilient cloud

By: Carl Brooks - 31/08/2012

Carl Brooks covers cloud computing and the next generation of IT infrastructure for 451 Research.

  • Like this
  • Close

Colt is always keen to support discussion on key issues affecting customers and service providers. In a recent research piece on www.451research.com, Carl Brooks from 451 Research addressed the issue of service outages and its impact on this new IT-as-a-service world. We felt it was so well argued and balanced that it needed a wider audience. Thanks to 451 Research for allowing us to repost this as a guest blog.

451 Research logo

In the first part of this series, T1R touched on several recent outages from cloud providers, websites and even a good old-fashioned mainframe outage . Hot on the heels of those are yet more reported outages – salesforce.com suffered a major outage on July 10, following a less serious outage on June 28. Murphy's law strikes again, but what does it portend for cloud users?

The promise of cloud computing was that it made all the hard work invisible and delivered the useful end product – servers and storage – as though it were magic. IT operators can appreciate the toil and expense that goes into maintaining a highly reliable environment; cloud users weren't supposed to; at least, that was the premise. In return for what are essentially dedicated virtual servers, cloud providers do all of the work that IT shops consider expensive and headache-producing: replicate data, provide failover if a server or a connection goes down, maintain all the underlying equipment, facilities and relationships with transit and peering providers and so on.

In the enterprise, that kind of reliability carries with it a stiff price tag, and the usual answer to question 'what's our uptime capability?' is 'How much are you paying?' Cloud computing has been seen as a panacea to that unpleasant reality. Software as a service has aided that illusion, since even when salesforce.com or the like goes down, it was rarely business-critical IT functions, and the assumption was that IaaS providers would take their infrastructure that much more seriously, and they do. IaaS providers generally maintain a minimum of 99.95% uptime for all their services, considerably better than the average IT shop. Of course, that's lower yet than more traditional hosting providers can provide; one of the tradeoffs of building on cheap commodity gear and designing around resilient automation systems instead of premium gear and lots of staff.

Of course, the truth is that every service provider will have outages, and the basic tradeoff of the cloud is that you give up control over your infrastructure in return for easy access, not to get invulnerability from anything going wrong. It appears that the answer to reliability for the enterprise is the same as it's always been (buy more resources), albeit the nuances and the technology are changing. Cloud enables new and better ways to achieve resiliency, but not with the current state of the enterprise.

Design for failure

Advocates and early adopters of cloud technology are quick to say 'design for failure,' meaning that infrastructure and applications should be built with the expectation that any and all parts of the system can fail. That's not exactly a new design philosophy, but it's rudimentary in the enterprise compared with how cloud people mean it. Netflix, which operates entirely on Amazon Web Services (AWS) at this point, famously wrote a 'chaos monkey' into its systems – a bit of code that semi-randomly disables parts of Netflix's online infrastructure to continually improve Netflix's response to problems. The company replicates all important data and systems across multiple geographical locations by default and takes fault tolerance to an extreme.

That's a few evolutionary steps above RAID and tape and spare parts on the shelf, which is the disaster plan for most traditional IT. Only the most important applications get site replication and automatic failover, because it's expensive. Conversely, using an IaaS provider means that replication is both easy and relatively less expensive, but more importantly, it is the only way to get demonstrable improvements in reliability.

Users can't buy more or less reliable cloud servers (SLAs can be purchased, but that doesn't mean actual uptime performance, of course).

As enterprise IT continually automates, refines and adopts the techniques that cloud providers and cloud users have pioneered, it will begin to adopt this philosophy, and that will happen in the management layer. All indications point to enterprise IT that functions more like IaaS than not, eventually. This kind of capability requires advanced management capabilities to extend across an organization, of course, but that too is on the way as enterprise IT departments face pressure to deliver the same kind of flexibility as Amazon does.

Server goes down? Instead of troubleshooting until it can be brought back up, why not bring up a duplicate, refresh it with a snapshot of the failed one, recheck any data against the master copy for consistency and then go troubleshoot the failure. Downtime becomes minutes instead of hours and leaves more time for root cause analysis and less time bug hunting. Some IT shops already do this today; it's common in the virtual desktop world and fluid environment like test and development.

Multi-cloud

That's the internal IT story. The need for security against outages is a concern of external IT users, those consuming public cloud services and hosted cloud-style environments. This may eventually include multiple cloud providers, but again, management will be the key. The natural extension of the traditional IT disaster-recovery strategy into public cloud would be to host failover environments with different providers. To some extent this is what AWS offers with its multiple Regions and Availability Zones, and its own best practices say that users need to duplicate resources and spread across several regions. But the wary IT professional would prefer to use another vendor entirely, for extra assurance, and that seems obvious.

The problem is that IaaS environments are not homogenous. In fact, they're usually markedly different in quirky ways, to the extent that applications are designed around performance characteristics for different providers. AWS is somewhat non-deterministic in routing traffic, so it's used for applications that aren't dependent on consistent I/O. Rackspace Cloud has great machine-to-machine connectivity, so can do exactly that. Microsoft Azure's Blob storage has fantastic consistency, speed and availability; some users treat it as a back end for Web applications running on their own IT. What all this means is that moving an infrastructure stack between providers is not at all trivial. In many cases, it's not possible.

However, that too, is a part of the evolution underway. Tier1 Research sees hybrid IT deployments becoming the dominant model over time. In time, there will be enough providers and environments to suit all comers. AWS has already created offerings like Cluster Compute to serve HPC users, and industry- and application-specific cloud offerings are already here. Azure would be the example of a Microsoft-specific cloud, for instance.

There will come a time when external clouds will be both resources and failover pools, and enterprises will actually be able to do many of the things that cloud computing seems to promise, but it will be a variegated landscape.

T1R take

Several trends need to materialize more fully before we see multi-cloud deployments that can protect enterprises from outages 'in the cloud.' One of them is more fully abstracted applications and infrastructure, which will be in turn more portable across different providers. This means a continual re-examination of how an IT department or MIS thinks about its operations, moving more and more toward modular, replaceable parts for all parts of IT, not just the obvious candidates. There's no reason an application server can't be as easily and painlessly replaceable as a hard drive in a RAID array, but it means rethinking a lot of received wisdom about production environments and traditional application stacks. The next trend is the adoption of true, multi-cloud, multi-resource management tools that can treat a server in the basement the same as a server at an IaaS provider. Cloud brokering technology needs to mesh with traditional IT management framework all the way up and down the stack. Companies like EnStratus and DynamicOps (bought by VMware for just this reason) can fill in some of those gaps, but we're a long way from mainstream.

The good news is that there is an ongoing opportunity for service providers to pick up enterprise business as these trends move forward, but the basic economic premise of hosted infrastructure only grows stronger as enterprises make these kinds of shifts. Infrastructure providers should be looking to encourage and enable hybrid environments and rudimentary multi-cloud approaches that can bridge that gap.


No comments on 'The real meaning of outages; coming full circle on resilient cloud' Be the first to comment


Add your comment
Close

Terms and Conditions Submit

Please choose your country