Outages are becoming a little too normal

As we come to terms with the fallout from major outages that were caused by recent widespread cyberattacks, it’s time to look at areas of concern in IT infrastructure protection and what we can do to prevent serious problems.

As we come to terms with the fallout from major outages that were caused by recent widespread cyberattacks, it’s time to look at areas of concern in IT infrastructure protection and what we can do to prevent serious problems.

We’re no strangers to cyber security issues here in Australia. In fiscal year 2015-2016, CERT Australia, the main point of contact for cyber threats affecting Australian businesses, responded to an overwhelming 15,000 incidents. And those are just the ones that were reported. While cyber security is often the topic du jour, the physical security and resiliency of your data centre, more often than not, are of paramount importance.

The data centre, the heartbeat of any organisation’s digital and IT presence, is a natural source of entry for a skilled attacker looking to cause damage. Nowadays, we’re seeing more organisations adopt a mixed approach to how they lay out their data centre. Public cloud in particular is growing significantly, with Gartner believing cloud spend in Australia will rise 15 per cent to $6.5 billion in 2017.

While this hybrid approach can bring benefits in terms of keeping certain workloads under your own roof, and perhaps putting others into the cloud where they can ramp up if needed, it means there’s more to think about when keeping valuable data and applications secure.

Getting the basics right

When avoiding outages, a crucial area of consideration is, unsurprisingly, resilience. This is twofold. The first is preventing any outage or interruption from occurring. The second, quite often forgotten, is recovering from an outage or interruption.  If you can’t bounce back quickly, that’s when the real damage sets in. Resiliency needs to be embedded into every aspect of your IT strategy; the on-premises side, public cloud, colocation, etc.

What we often see is that there’s a misunderstanding around the public cloud – that once you adopt it, all the incredible security and resiliency that is built into the provider is automatically transferred to your IT framework. This is only partially true, and presuming more can be dangerous. It’s important to closely work with your provider and, if need be, your third-party supplier to ensure that the right levels of resiliency are fully incorporated. You also need to know what their strategies are when something goes wrong such as an outage or interruption.

Another important piece in getting resiliency right is concurrent maintainability, i.e. the ability to take systems offline during maintenance while keeping your critical infrastructure running. This freedom allows for a more rigid approach to maintenance and also helps to ensure safety for maintenance workers.

The human element

Humans are inevitably involved in the design, deployment and maintenance of any data centre, so inherently, that means human error is a potential outage risk. In a sense, every outage is linked to human error, whether directly or via poor maintenance or design, the wrong procedures, and a host of other potential reasons.

There’s a tipping point where you can end up with too many cooks in the kitchen. In an attempt to make their sites resilient without the cost, many data centre operators add an enormous amount of complexity to their infrastructure. This means more human involvement, and less automation where it is useful, thus increasing the risk for human error.

Sometimes it’s best to keep it simple – those extra layers of work that come with added complexity aren’t always factored in when considering budgets. These could end up being more costly in the long run.

Know what you’re buying

Many operators are sold on the concept of N+1 redundancy. And why shouldn’t they be? In its true meaning, N+1 means that in the event of a component failure, each component has at least one independent backup counterpart, and the plus one should flow through to the system level of that device.

The reality is that an outright N+1 system should mean exactly that. But in many instances we are seeing the plus one label blurred from a system level to a component level in isolation. The next level beyond N+1 is 2N where you have two totally independent systems, but that option is too expensive for most businesses.

That isn’t to say that N+1 options are inadequate. Completely independent systems are essential for highly sensitive applications, where the loss of power or systems is detrimental to the business. A well-designed N+1 system can achieve high levels of resiliency without the need of going to 2N and can also increase utilisation. What’s important is making sure you know what exactly is meant by N+1, what the supplier is giving you before implementing it, and that it matches up with your organisation’s requirements.

In the current landscape, CIOs can be potentially blindsided by a good sales pitch, clouding their vision on what the organisation really needs. There are so many solutions out there and costs can ramp up quickly if you develop a hypochondriac approach to data centre security.

Every organisation has a different risk profile – it’s important to know what yours is and what you need to do to stay online and avoid the costly fallout from an outage. Getting this balance right will only become more vital as organisations and the very world we live in become ever more dependent on technology.

Mark Deguara is director, data centre solutions, at Vertiv Australia and New Zealand.

Join the newsletter!

Error: Please check your email address.

Tags vertiv

More about 2NAustraliaCERT AustraliaGartnerVertiv

Show Comments
[]