Google offers lots of services and it has pretty good reliability. How does the company do it?
Much of that is up to Ben Treynor, Google's vice president of engineering, and founder of the company's site reliability team. And he's developed an interesting approach at Google for thinking about reliability.
People may assume that the vendor is aiming for Google Apps and its other services to be up and available 100% of the time. Sure that may be the goal, but Treynor is realistic. Each Google product has a service level agreement (SLA) that dictates how much downtime the product can have in a given month or year. Take 99.9% uptime, for example: That allows for 43 minutes of downtime per month, or about 8 hours and 40 minutes per year. That 8 hours and 40 minutes is what Treynor refers to as an "error budget."
Google product managers don't have to be perfect - they just have to be better than their SLA guarantee. So each product team at Google has a "budget" of errors it can make. Basically, they just can't make more mistakes than what the SLA allows for.
Treynor explains that in a traditional site reliability model there is a fundamental disconnect between site reliability engineers (SREs) and the product managers. Product managers want to keep adding services to their offerings, but the SREs don't like changes because that opens the door to more potential problems. This "error budget" model addresses that issue, though, by uniting the priorities of the SREs and product teams.
FUN FACT:Treynor collects cool cars
If the product adheres to the SLA's uptime promise, then the product team is allowed to launch new features. If the product is outside of its SLA, then no new features are allowed to be rolled out until the reliability improves.
By putting the onus on the product developers to architect reliable systems, it's a win-win for everyone. SREs get to have reliable systems, developers get to add features and users don't experience downtime (hopefully). Having a system of error budgets - instead of mandating 100% uptime - gives developers and engineers some leeway, while more closely aligning the priorities of developers and site reliability workers. Watch a video of Treynor explaining the process here.
It seems to work. According to tracking firm CloudHarmony, Google's IaaS cloud computing platform had some of the best uptime statistics among the major vendors last year. See more details of how Google compared to Amazon, Microsoft and others here. Of course outages still do happen; Google Compute Engine (GCE) suffered one this month, in fact.