This week's cloud tempest is the very visible breakdown of Microsoft's Danger storage service for the T-Mobile Sidekick phone. An apologetic email (as reported by TechCrunch) first went out from Microsoft to users noting that all data had been lost with no way to recover it. Apparently, it now seems that some or most of the data will be recovered, which is, of course, good news. I don't know that Microsoft has provided any formal explanation of what went wrong, but most of the speculation I've seen identifies a failed SAN upgrade with no data backup available as the cause for the data loss.
People on all sides of the cloud debate have been debating this incident and treating it as though it is a proxy for the entire concept of cloud computing.
While it's unlikely that one should conflate this situation with the totality of cloud computing, there are some very, very important issues highlighted by this situation that are worth exploring and understanding.
Lessons to be Drawn
It's a cloud: Some writing I've seen on this incident downplay it because, in the view of the authors, this service isn't really a cloud offering. They say it's a limited application, or an adjunct service to a hardware device, or it's really a consumer service and therefore not a "real" cloud application because those are aimed at business users. That's baloney.
First of all, it is a cloud application. It certainly fits into the common SaaS definitions. The "it's really a consumer service" rationale won't wash, either. With the blurring of consumer and commercial use, what's personal to one person might be mission-critical to another. And trying to deflect concern about this incident by defining it away misses the point. Cloud computing is a big tent (if I may mix a metaphor), and one of its strengths is the fact that many different approaches can be considered as cloud computing. In any case, clever dissembling is beside the point. If it walks like a duck, quacks like a duck, trying to convince someone that it's not a duck because it's actually a similar looking, slightly different species is unlikely to be successful.
This attention bespeaks intense interest in the cloud: Let's face it, all the hullabaloo about this incident is good news, because it means people recognize cloud computing is an important development. You don't spend a lot of time worrying about something you don't care about. It's obvious that the concept of cloud computing has garnered attention, to which I attribute the fact that everyone recognizes that the old methods of running IT infrastructure are expensive and don't scale.
This incident represents a breach of best practices: Losing data is the greatest shortcoming an operations group can suffer. A service outage is bad, but losing data is inexcusable. In fact, calling this a breach of best practices is overstating it. The term "best practice" describes a set of processes performed by the leaders in a field, not the mainstream. Backing up data is data management 101; really, it's 01. If this incident is truly a result of failing to do a backup, it contravenes the basic, simplest practice of managing data. No matter what the cause, losing data is inexcusable.
It calls into question one of the tenets of cloud computing: The expertise of cloud providers. My company does not run its own email service; we use Google to manage our mail system. Is this because we don't know how to run a mail server? Of course not. We do it for a very simple reason: using Google allows us to focus on our core mission, serving our clients.
We are very aware of what would happen if we ran our own mail server. Every time there was a problem, we'd treat it like an inconvenient interruption, and do just the minimum to patch the problem and get back to our real work. We would never devote the full amount of time running a mail server deserves. Therefore, our mail service would always be fragile, subject to interruption, and (most likely) vulnerable to security penetration. So we turn to a company that can devote real resources to running our mail server, one that follows best practices, and one that can take the necessary time to do it right.