Yesterday, Feb 29th, the leap day in a leap year, saw not only the third day in Microsoft’s MVP 2012 Summit, not to mention the fifth iteration of my personal MVP Summit party, #ChezNeward, but also one of the most embarrassing outages in cloud history. Specifically, Microsoft’s Azure cloud service went down, and it went down hard. My understanding (entirely anecdotal descriptions, I have no insider information here) was that the security certificates were the source of the problem: specifically, they were set to expire on Feb 28th, and not to renew until March 1st. (I don’t know more details than that, so please don’t ask me for the details as to how this situation came to be.)
For those of you playing the home game, this means (IIRC) that each of the major cloud providers has had a major outage within the last two years: Azure’s yesterday, Amazon’s of a few months ago, and of course Gmail goes down every half-year or so, to tremendous fanfare and finger-pointing. (You can hear the entire Internet scream when it does.)
Can we please stop with the hype that somehow “The Cloud” is the solution to all your reliability problems?
I’m not even going to argue that the cloud services hosted by “the big boys” (Microsoft, Google, Amazon) aren’t somehow more reliable; in fact, I’ll even be the first to point out that by any statistical measure I’ve seen examined, the cloud providers stay up far more often than what a private data center achieves. Part of this is because of the IT equivalent of economies of scale: if you’re hosting five servers, you’re not going to put as much money and time into keeping the data center running as if you’re hosting five thousand servers. HVAC and multiple Internet connections and all are expensive, and for a lot of companies, remain entirely out of their IT budget’s reach.
What companies need to realize is that moving to the cloud isn’t just moving your software out of the data center and into somebody else’s data center—it’s also a complete loss of control over what happens when an outage occurs.
When a company builds a business that puts technology at the front and center of its operations, and then puts that technology into the hands of a third party for safe-keeping and management, that company loses a degree of control over when and how the emergency response happens. If the data center is inside your building, managed by your people, you (the CEO or CTO) have a say in how things come back online—do you restore email first, or do you restore the web site? Is the directory service the most critical aspect of your system? And so on.
More importantly, your people are on it. They may not be as technically gifted as the people that manage the cloud centers (or so at least the cloud providers would have you believe), but your people are focused on your servers. Inside the cloud centers, their people are focused on their servers—and restoring service to the cloud center as a whole, not taking whatever means are necessary, including potentially some jury-rigging of servers and networking infrastructure, to get your most critical piece of your IT story up and running again.
Readers may think I’m spinning some kind of conspiracy theory here, that somehow Microsoft is looking to sacrifice its Azure customers in favor of its own systems, but the theory is much more basic than that: Microsoft’s Azure technicians are trying to restore service to their customers, but they don’t really have much preference over which customers get service first, whether that’s you or the guy next to you in the rack. Frankly, for a lot of businesses, you’re the same way: one customer isn’t really different from another. Yes, we’d like to treat them all “special”, but when the stress ratchets up through the roof, you’re not going to quibble over which one gets service first—you’re going to break your neck trying to get them all up ASAP, rather than just a few up first.
Some businesses are OK with this kind of triage on the part of their hosting provider. Some, like the now-infamous cardiac monitoring startup that was based on AWS and as a result lost connections to their patients (a potentially life-threatening outage for them) when AWS went down… yeah, some businesses aren’t OK with that.
Cloud will never replace on-premise hosting. It will supplement it in places, it will complement it in others. The sooner the CTOs and CIOs of the world realize that, the better.
Thursday, March 01, 2012 11:00:51 PM (Pacific Standard Time, UTC-08:00)