Amazon Web Services last year was estimated by Gartner to be five times bigger than its next 14 competitors combined. That's a lot of virtual machines. And they all run on a customized version of the open source Xen hypervisor, so when the Xen code has a security vulnerability, that's a big deal for AWS.
In the past six months AWS has twice had to reboot some of its Elastic Compute Cloud (EC2) servers because of a Xen vulnerability. In September, 2014 about 10% of EC2 instances were rebooted and just this week AWS announced that about 0.1% of instances had to be rebooted to install a security patch. That may not sound like a lot, but at the scale AWS operates, it's still a large number.
What happens inside AWS when there's a Xen vulnerability discovered?
The answer is that Steve Schmidt gets busy (not that he isn't already). Schmidt is AWS's vice president of security engineering and chief information security officer (CISO) and he's a former FBI section chief. He's the man keeping AWS's cloud secure. In November, Network World sat down with Schmidt at the AWS re:Invent conference and asked him to walk us through what happened inside AWS's cloud operations during the big September reboot.
Verify the vulnerability
AWS is a big user of Xen code, so company officials are some of the first to hear about Xen vulnerabilities that are identified in the open source community. When that occurs the first job for Schmidt's team is to determine if it will impact AWS. The company is notified of all the Xen vulnerabilities on a regular basis before they're made public. This allows the company to determine if the vulnerability is applicable to AWS and if so develop and install a patch.
"Xen is a huge software package and there are many aspects that AWS does not use," Schmidt said.
Most of the Xen vulnerabilities do not apply to AWS because the company has developed its own custom version of Xen. AWS has stripped out all the features of Xen that it doesn't need, both in order to customize the performance of the open source code to the company's unique use case, and to limit its exposure to vulnerabilities.
But AWS does something else, too: It doesn't just use one flavor of Xen, it uses many.
"We intentionally build our fleet differently across (the service)," he said. "You don't want everything to be homogeneous because if it is then if a problem effects the fleet, it effects everything." AWS has different custom versions of Xen deployed across different services and regions, and none of them are the vanilla open source code.
An internal hack
If the AWS cloud is impacted then the company tries to hack itself.
"We generate a test scenario to determine if we can trigger the vulnerability," Schmidt said. Then, extensive testing is done to determine if the vulnerability has been used against AWS.
Meanwhile, other teams of security engineers are already building a patch and testing it across all the variants of Xen that AWS runs to ensure it meets security and performance requirements.
Sometimes the process of installing the patch requires a reboot, as it has twice in the past half-year. Just like on a common PC, some updates and patches require a reboot and others don't. The majority of patches AWS implements do not require a reboot; AWS has architected its system to minimize the reboots necessary to patch its services.
"We try very hard not to reboot," Schmidt said. If Schmidt's team finds it "technically infeasible" to install the patch without a reboot, then it notifies customers which services will be restarted.
The dreaded reboot
"It was very straightforward," Schmidt said, referring to the September issue. "We couldn't find a way to patch the service without rebooting, so we had to do it."
Complicating efforts in situations like this is the fact that AWS has to inform customers that some of their EC2 instances need to be rebooted, but they can't say why. AWS can't announce the vulnerability to the world and expose itself or other Xen users.
Customers should be ready for a reboot at any time though and there are steps users should take to ensure their systems can withstand a reboot or VM failure. One is to design their systems to be stateless so that if there is a reboot or a VM failure that the application fails over to healthy VMs without skipping a beat.
Back in September Network Worldspoke with a handful of AWS users and most survived the reboot without a major issue. Born-in-the-cloud apps tend to be resilient to failure; legacy apps that have been migrated tend to have more trouble.
Schmidt said AWS is always looking to improve its services: both technically to ensure it doesn't have to reboot VMs, and it is working to keep customers better informed. Part of that process includes sponsoring academic research, including some leading studies into how Xen servers can be hot-patched without requiring a reboot.