Two open source tools have helped National Australia Bank build a resilient cloud-based infrastructure to deliver its online presence, David Broeren, head of the digital and online channels at the bank, told the Amazon Web Services summit in Sydney today.
NAB has been using Bees with Machine Guns — a load testing utility developed by the News Applications team at the Chicago Tribune — and the infamous Netflix-developed Chaos Monkey tool to test the resilience of nab.com.au, which is hosted in Amazon's public cloud.
Broeren traces the bank’s relavitely recent transition to cloud for delivering its online presence back five years to what is now known as ‘agile@nab’, which started with the deployment in 2009 of Jenkins for CI, and continued with the establishment of an internal Git-based source code repository in 2011 and, more recently, the use of the Artifactory repository manager.
These were “key foundational processes that we needed to do to be able to prepare us for what we do now,” which an "infrastructure as code" approach, Broeren told the conference.
The transition to AWS started with creating an account and the single click of a button, Broeren said.
“The first click of the button actually established two EC2 instances linked together through an Elastic Load Balancer. The next step through CloudFormation was to establish S3 with all the server images that we needed to be able to build the site. [Then] CloudWatch alerting. And to finish it off, autoscaling.”
“In 59 minutes we went from an account with Amazon Web Services to basically two data centres,” Broeren said.
“So the next click was to actually deploy servers, and within two minutes we had 40 servers out there running. Again – nothing actually running on it but we had two data centres, all the servers out there running.
“And the last click, did actually put nab.com.au out onto it.”
The point of using Bees with Machine Guns is “to put a brute force load onto the site to test out its resilience,” he said. “Launch the bees, ramp up the load. But the beauty of Amazon Web Services is that it was ready for it: CloudWatch alerting gives autoscaling a heads up, and we then add more servers automatically to take the load.
“From there, pretty it’s simple — you take the bees away and Amazon [takes] us back to our standard config.”
Bees with Machine Guns is run in NAB’s dev environment; however, thanks to the use of cloud the dev, test and production environments are all identical Broeren said.
On the other hand, Chaos Monkey is run in NAB’s production environment, he added. Chaos Monkey randomly attacks and destorys EC2 instances, which can then be replaced automatically.
“To get full effect you’ve got to run it in production – and we run it 24 by 7, 365 days of a year,” he said.
The upshot of switching to cloud has not just been agility, but also the ability to reduce the load on the bank’s operations teams, Broeren said.
“There are tens of billions of dollars that go through the bank each day and it’s a stressful job. If there’s anything I can do to make the operations guys’ life easier, I’ll do it.”
“From a change perspective we’ve been able to go from major to minor changes for most things on the on the nab.com.au platform,” Broeren said.
A second factor reducing the load on ops is that the ability automatically scale to cope with increased load, which has allowed the bank to remove threshold monitors.
“With NAB.com.au on Amazon we know that with increased demand the system will scale itself and for every change that we put through the platform we’ve tested it – we’ve safely removed those thresholds. So no calls, no SMSes.”
Finally, he said, the combination of autoscaling and Chaos Monkey mean that that “something which would traditionally be a high severity incident — that is, loss of a server … with Chaos Monkey we’re always testing it, so we go from a high severity to an information event.”
Broeren said that his team is looking at other tools such as Janitor Monkey and Chaos Gorilla from Netflix’s ‘Simian Army’ as part of the next steps “towards the utopia of zero down time.”