Storage-area network complexity can mask what might seem to be relatively benign issues that have the potential to build up and cause an outage or brownout. To identify trouble early, you need to create a SAN performance benchmark, an essential first step to setting up metrics to gauge infrastructure performance.
The key is to establish the metrics in advance. Most companies wait until they have a problem before trying to truly understand baseline performance. Ironically that is the worst time to look because: a) what is found is often overwhelming, b) often multiple issues appear to be the cause and it can be difficult to know where to start, and c) many performance optimization opportunities are overlooked.
Here are best practices for benchmarking SAN performance:
1. Baseline when the SAN is healthy. The best time to evaluate an environment is when everything is healthy and before a cost-saving or performance-enhancing project is implemented. This provides a metric to compare the "problem" state to the baseline, making it immediately obvious where the problem resides.
Ideally, a company should be proactive with the initial baseline and address the issues that are present. Eliminating existing issues helps reduce the number of problems that can together cause a brownout. Optimization savings can be well planned and measured by comparing both consolidation effectiveness and user impact
A good baseline will often reveal over-provisioned infrastructures, ineffective use of tiers, multi-path issues, uneven load distribution, physical layer problems, minor device incompatibilities, improper configurations (zoning, I/O size request, queue depths), out of control applications, unnecessary load or intermittent performance issues.
2. Measure what matters. The most important goal for an application user is to see their actions complete successfully and accurately in a timely fashion. There are two secondary goals for the IT organization: how to resolve user issues, and how to ensure the solutions use only the resources necessary.
Companies often rely on the most readily available metrics rather than the most useful. One such metric is I/Os per second. This metric only addresses two secondary measures: is the I/O causing a problem, and how optimal is it? It does not get to the heart of the most important questions: how quickly are things getting done, and are they all successful?
Rather than looking at I/O, for effective monitoring you need to consider:
* Minimum, maximum and average for Read/Write/Other Exchange Completion time (ECT) (9 metrics) for every host bus adaptor (HBA), storage port and logical unit number (LUN).
* Minimum, maximum and average read command to first data for every HBA, storage port and LUN.
* Minimum, maximum and average pending exchanges (queue depth) for every HBA, storage port and LUN.
* Read/write/other I/O size for every HBA, storage port and LUN.
Another common mistake is to give a metric more credit than it deserves. For example, relying on a server response time (either from the operating system or an application on the server) to determine the health of the rest of the infrastructure.
There are several problems with this approach that make it insufficient to determine whether or not the infrastructure is causing issues. One challenge is that the measurement is impacted by all of the resources on the server. Server issues can cause this measurement to appear artificially long when in fact something as simple as a busy CPU can be the real problem, not I/O transaction times.
The other issue is it relies on the same resources that are being monitored to do the monitoring. Therefore, either large averages or samples are all that are gathered. Ironically, when things are slow fewer transactions are completed. If that applies to only one resource for the server (for example a single LUN or virtual machine), the response times can still look good even though there is a big problem. When you average tens of thousands of good transactions with tens of thousands of bad… the result is everything looks good. Outlying infrastructure problems can be missed.
3. Measure the complete I/O transaction path. Because application response time is measured on the server by the server, it is only a rough indicator in a benchmark. Administrators should look to latency deltas throughout the data path to establish baselines for effective troubleshooting.
Another mistake is relying on the end devices or components to tell you if the infrastructure is healthy. Storage arrays and switches provide useful information when problems are present, but they aren't designed to determine if a problem exists in the infrastructure as a whole. They are inward focused rather than infrastructure focused and lack the granularity to be conclusive. They simply cannot prove conclusively that all of the transactions are completing successfully from host to array and back again in a timely fashion.
4. Use non-intrusive instrumentation. Use instrumentation that is vendor-independent and not SAN component-derived. It will help provide accurate, comprehensive, cross-vendor benchmark metrics. The ideal way to baseline an infrastructure is to find a solution that monitors the environment without the performance impact that a component might have or the outside influences that a server has.
5. Measure every transaction, in real time. The solution needs to be able to monitor every transaction to ensure that they complete successfully in a timely fashion and present the data frequently enough (ideally every second) to ensure that outliers are not missed. The typical one- or five-minute averages most tools report are guaranteed to miss problems.
Establishing metrics in advance based on data captured when the SAN is healthy and application response times are acceptable is key to identifying SAN troubles early. These benchmarks will enable you to spot what otherwise might seem like benign issues and prevent outages.
Foster is a principal architect for Virtual Instruments professional services.