Rackspace experienced an outage yesterday; a recurring issue this year for the hosted data center provider; which took down a number of high profile sites including the popular blog site TechCrunch. No network is impervious to outages, but a company like Rackspace needs to provide consistent and reliable service.
The Official Rackspace Blog explains "On December 18, 2009 between 3:37 p.m. and 4:12 p.m. CST, Rackspace experienced network connectivity problems." The timeline doesn't jive with the fact that the timestamp on the TechCrunch report on the Washington Post site says 12:17pm. Assuming the TechCrunch timestamp is Pacific time, it would mean that the outage began more like 2pm Central time, or possibly even earlier.
Aside from TechCrunch, a number of other services and blog sites were impacted by the Rackspace outage, including 37signals, Brizzly, Robert Scoble's blog, sites hosted by Laughing Squid, Tumblr, and Mashable.
The Rackspace blog describes the root cause: "The issues resulted from a problem with a router used for peering and backbone connectivity located outside the data center at a peering facility, which handles approximately 20 per cent of Rackspace's Dallas traffic."
The blog post goes on to explain that the router configuration error was part of final testing for data center integration between the Chicago and Dallas facilities, and that it should not have impacted operation during normal business hours. "The network integration of the facilities was scheduled to take place during the monthly maintenance window outside normal business hours, and today's incident occurred during final preparations."
The outage left many Rackspace customers saying "Hey! Who turned off the cloud?"
While a data center outage that impacts popular and well known sites is a black eye for cloud computing in general, the scope of the impact from this outage was relatively small. As this blog points out "Rackspace is small potatoes. Now it's a fast growing bag of potatoes, but still dinky. And the other catch: Rackspace is more about hosting than the cloud."
For customers that rely on Rackspace to host their servers — especially Web servers — it may seem very much as if the Internet went down when the Rackspace data center was unavailable. However, cloud computing services like Amazon EC2 and Microsoft Azure, and Internet keystones like Google, and Amazon were not impacted at all by the Rackspace outage.
Mistakes happen, but customers of Rackspace have a right to question the repeated outages and service interruptions. At least one Rackspace customer is also upset about a related issue pertaining to customer notification of network issues like this outage.
The customer's hosted servers were affected by the Rackspace outage and found out from customer complaints that its site had been unavailable for two hours. In a comment, the customer stated "We also pay Rackspace extra for a constant monitoring service that is supposed to immediately notify me by email or phone call if our server becomes inaccessible at any time. I was HIGHLY disturbed to find out that Rackspace actually SUPPRESSED these notifications from being sent to their customers for some strange reason."
The comment offers no evidence to support the claim that Rackspace intentionally withheld notification, and I have not had any feedback from Rackspace to confirm or deny the accusation. If it turns out to be true, it would damage Rackspace's credibility and customer service reputation.
The bottom line, though, is that Rackspace determined the cause of the problem and fixed it relatively quickly, and it provided status updates on the blog to keep customers informed. Even brief outages seem devastating to those affected by them, but they will happen, and when they do this is pretty much how you want them handled.