Cloud architecture: Questions to ask for reliability
- 19 July, 2011 05:35
I've been an architect on some complex applications and I have a significant concern about assessing architectural risk for public/private cloud applications. Traditional risk assessments focus on external/internal access to confidential information like social security numbers, credit card number, and for banks PINs for the ATMs. Access controls and network protection are high priorities because they suppress the risk.
I'm interested in something a little different -- I'll call it architectural reliability. The desire is to avoid single points of failure for critical applications so that catastrophic errors don't occur; those lead to huge financial losses and a diminished corporate brand. So, where would I start to shore up the architecture? Here are some storage and networking diagnostic questions I would ask for the top-10 applications within a corporation. Note that some questions that need to be asked are pertinent to all applications and some just within a given domain. I'm going to focus on just the storage and networking product domains that support the top-10 applications.
[See also: Five cloud security trends experts see for 2011]
Storage Architecture -- All Applications
Is only one SAN vendor used for storage of all of the applications?
How is data de-duplication addressed?
Is only one SAN switch vendor used for all of the applications?
Is only one data replication vendor used?
Is only one encryption vendor used to encrypt data for all of the applications?
Which encryption algorithm is used for a given encryption tool?
Is only one PKI vendor used to manage certificates?
Where are the certificates related to data at rest encryption stored?
Storage Architecture -- Each Application
What storage subsystem does the application run on?
Which other applications run on the same subsystem?
Is the data on the storage subsystem replicated elsewhere or is this the only copy?
How is the need for more data storage addressed for a given application?
What SAN switch is used for traffic to/from the storage subsystem?
What network components are used to replicate SAN data from one data center to another remote data center?
What is the application that performs data replication?
What is the software version and release for the data replication application?
Which encryption vendor is used to encrypt Confidential data on a given storage subsystem?
Does the storage for the encryption tool also run on a SAN shared with other applications?
Can corruption of the encryption data affect multiple applications or just this application?
What PKI vendor is used?
What version and release of PKI software is deployed?
Network Architecture -- All Applications
Is there only one switch/router vendor?
Is there only one firewall vendor?
Is there only one Intrusion Protection System/Intrusion Detection System (IPS/IDS) vendor?
Is there only one load balancer vendor?
Is there just one telecommunications vendor to the internet and/or WAN (Wide Area Network)?
Network Architecture -- Each Application
Which switch/routers are used within the data center?
Which switch/router models are used?
Are the switch/routers in an architecturally redundant design?
What version of embedded software and model of hardware is used in switch/router deployment?
Which firewall vendor is used?
What models of firewalls are deployed in the data center?
Are there a limited number of firewall permutations that are deployed? (embedded OS version, hardware model, features)?
What intrusion protection/detection products are deployed?
Which intrusion protection/detection vendors are used?
What permutations of IPS/IDS are deployed in the data center?
What version of IPS/IDS software is deployed?
Which vendor's load balancers are used?
Which load balancer model is used?
What is the version of the load balancer's embedded software and model of hardware?
Are they used to steer traffic between different global data centers?
Are the load balancers redundant, could one instantly take the place of another?
What telecommunications vendors are used for internet access?
What WAN telecommunications vendor is used for traffic between data centers?
What WAN telecommunications vendor is used for traffic between offices and the data center?
Is the telecommunications equipment redundant?
Is the telecommunications fiber underground physically separate?
These questions cover a large chuck of storage and networking diagnostic questions. I'm sure that I've missed some; but this should provide a flavor of what the critical web applications are using within the infrastructure cloud layer. These questions give insight into whether or not failure in a given product would affect multiple applications. It helps companies design and tune the architecture properly so that redundancy can be created in all products where possible. Then the failure of a given product does not cascade to multiple critical applications. It is very likely that it is much cheaper to over-engineer, thereby anticipating and reacting well to failure, than it is to have very expensive cloud services downtime.
The questions associated with whether or not only one vendor is used for a given product type reveals a potential enterprise weakness. Full reliance on one vendor can lead to significant failure if a specific product hardware/software release is flawed and occurs under stressful conditions only. Then, all cloud applications that use that product would be impacted negatively. The other questions address what I'll call use congestion. Multiple applications are sharing the same component (storage subsystem, server, or firewall). The product failure affects all those applications simultaneously.
In summary, this article focuses on architectural reliability. It creates a set of questions just focused on products within the storage domain, encryption of data-at-rest, and the networking domain. Since the cost of products is much cheaper than application downtime over-engineering is encouraged where possible. The need to deploy more product vendors must be balanced with a need to limit product and feature permutations so that realistic disaster recovery scenarios can be tested. Please see a previous article that I wrote on this. I'll visit other cloud layer diagnostic questions in the next article.