With the economy still shaky and the need for storage exploding, almost every storage vendor claims it can reduce the amount of data you must store. Trimming your data footprint not only cuts costs for hardware, software, power and data center space, but also eases the strain on networks and backup windows.
But how do you know which technique to use? First you have to understand how your business uses data and determine when the cost savings of data reduction are worth the resulting drop in performance.
The technique that's best for you depends not so much on the industry you're in as it does on the type of data you store. For example, deduplication often doesn't deliver significant savings for X-rays, engineering test data, video or music. But it can significantly reduce the cost of backing up virtual machines used as servers, for example. Here are five techniques to help reduce your stored-data volume.
Deduplication -- the process of finding and eliminating duplicate pieces of data stored in different data sets -- can reduce storage needs up to 90%. For example, through deduplication, you could ensure that you store only one copy of an attachment that was sent to hundreds of employees. Deduplication has become almost a requirement for backup, archiving and just about any form of secondary storage where speed of access is less important than reducing the data footprint.
Chris Watkis, IT director at health care advertising and marketing firm Grey Healthcare Group, is seeing reduction ratios as high as 72:1 for backup data, thanks to a deduplication process that uses FalconStor Software Inc.'s Virtual Tape Library storage appliance. And cloud storage services vendor i365 is achieving 30:1 to 50:1 reductions in data on a mixed workload of Microsoft Exchange, SharePoint, SQL Server and VMware virtual machine files, says Chief Technology Officer David Allen.
Data can be deduped at the file or block level, with different products able to examine blocks of varying sizes. In most cases, the more fine-grained assessment a system can do, the greater the space savings. But fine-grained deduplication might take longer and therefore slow data access speeds.
Deduplication can be done preprocessing, or inline, as the data is being written to its target; or postprocessing, after the data has been stored on its target. Postprocessing is best if it's critical to meet backup windows with fast data movement, says Greg Schulz, senior analyst at The Server and StorageIO Group. But consider preprocessing if you have "time to burn" and need to reduce costs, he says.
While inline deduplication can cut the amount of data stored by a ratio of about 20:1, it isn't scalable, and it can hurt performance and force users to buy more servers to perform the deduplication, critics say. On the other hand, Schulz says that postprocessing deduplication requires more storage as a buffer, making that space unavailable for other uses.
For customers with multiple servers or storage platforms, enterprisewide deduplication saves money by eliminating duplicate copies of data stored on the various platforms. This is critical because most organizations create as many as 15 copies of the same data for use by applications such as data mining, ERP and customer relationship management systems, says Randy Chalfant, vice president of strategy at disk-based storage vendor Nexsan Corp. Users might also want to consider a single deduplication system to make it easier for any application or user to "rehydrate" data (return it to its original form) as needed and avoid incompatibilities among multiple systems.
Schulz says primary deduplication products could perform in preprocessing mode until a certain performance threshold is hit, then switch to postprocessing.
Another option, policy-based deduplication, allows storage managers to choose which files should undergo deduplication, based on their size, importance or other criteria.
SFL Data, which gathers, stores, indexes, searches and provides data for companies and law firms involved in litigation, has found a balance between performance and data reduction. It's deploying Ocarina Networks' 2400 Storage Optimizer for "near-online" storage of compressed and deduplicated files on a BlueArc Mercury 50 cluster that scales up to 2 petabytes of usable capacity, rehydrating those files as users require them.
"Rehydrating the files slows access time a bit, but it's far better than telling customers they have to wait two days" to access those files, says SFL's technical director, Ruth Townsend, noting that the company gets as much as 50% space savings through deduplication and file compression.
Probably the most well-known data reduction technology, compression is the process of finding and eliminating repeated patterns of bytes. It works well with databases, e-mail and files, but it's less effective for images. It's included in some storage systems, but you can also find stand-alone compression applications or appliances.
Dedupe and Compression: Better Together?
Some vendors offer, or will offer, both deduplication and compression. Others, such as Ocarina, decode already-compressed files before optimizing them. Randy Chalfant, vice president of strategy at Nexsan, argues that data should be compressed at the file or operating system level and deduplicated on the storage target. Cloud-based deduplication and compression vendor Asigra Inc. first compresses and then deduplicates data, and stores only changes made to it.
The choice of whether, when and in what order to use both compression and deduplication depends on factors such as whether compression will make it easier or harder for the deduplication software to scan for redundancies, what tier (primary vs. secondary) you're looking to optimize, and how quickly the product can return data to a usable form when needed.
-- Robert L. Scheier
Real-time compression that doesn't delay access or slow performance by requiring data to be decompressed before it's modified or read is suitable for online applications like databases and online transaction processing, says Schulz. The computing power within modern multicore processors also makes server-based compression an option for some environments, he adds.
Allen of i365 says the benefits of compression vary. It can reduce data by ratios of 6:1 or more for SQL databases, but for file servers the ratios are closer to 2:1. According to Fadi Albatal, vice president of marketing at FalconStor, compression is most effective on backup, secondary or tertiary storage, where it can reduce storage needs by ratios of 2:1 to 4:1 for "highly active" database or e-mail applications. When information management services firm Iron Mountain Inc. archives applications, compression and deduplication reduce storage by as much as 80%, says T.M. Ravi, Iron Mountain's chief marketing officer.
IBM focused attention on compression of primary storage with its acquisition of Storwize, whose appliance writes compressed files back to the NAS device on which they originated or to another tier of storage. Storwize is beta-testing a block-based appliance, says Doug Balog, vice president of IBM storage.
Files compressed by Microsoft Office applications or popular image formats such as JPEG can't be reduced with many common compression techniques or may even increase in size. Neuxpower Solutions Ltd. claims that its software can shrink Office and JPEG files by as much as 95% without loss of image quality by removing unnecessary information such as metadata or details that can't be seen unless the image is enlarged. Ocarina, which is being acquired by Dell, says its products offer similar capabilities because they use multiple optimization algorithms tuned for different types of content, and they have the ability to test and choose among various compression methods for the best runtime efficiency.
Deduplication and compression are complementary. "Use compression when the primary focus is on speed, performance, transfer rates. Use deduplication where there is a high degree of redundant data and you want higher space savings," says Schulz.
3. Policy-Based Tiering
Policy-based tiering is the process of moving data to different classes of storage based on criteria such as its age, how often it is accessed or the speed at which it must be available (see "The Politics of Storage"). Unless the policy calls for the outright deletion of unneeded data, this technique won't reduce your overall storage needs, but it can trim costs by moving some data to less expensive, but slower, media.
Vendors in this market include Hewlett-Packard Co., which offers built-in policy management and automated file migration in its StorageWorks X9000, and DataGlobal GmbH, which says that its unified storage and information management software enables customers to analyze and manage unstructured files and other information and thereby reduce their storage needs by 60% to 70% for e-mail and about 20% for file servers.
Other products with tiering capabilities include Storage Center 5 from Compellent Technologies, HotZone and SafeCache from FalconStor, Policy Advisor from 3Par, EMC's FAST and F5 Networks' ARX series of file virtualization appliances.
4. Storage Virtualization
As is the case with server virtualization, storage virtualization involves "abstracting" multiple storage devices into a single pool of storage, allowing administrators to move data among tiers as needed. Many experts view it as an enabling technology rather than a data reducer, per se, but others see a more direct connection to data reduction.
Actifio Inc.'s data management systems use virtualization to eliminate the need for multiple applications for functions such as backups and disaster recovery. Its appliances let customers choose service-level agreements governing the management of various data sets from a series of templates.
With this method, the proper management policies are then applied to a single copy of the data, defining where, for example, it is stored and how it is deduplicated during functions such as backup and replication. Company co-founder and CEO Ash Ashutosh claims that Actifio can cut storage needs 75% to 90%.
5. Thin provisioning
Thin provisioning means setting up an application server to use a certain amount of space on a drive, but not using that space until it is actually needed. As with policy-based storage, this technique doesn't cut the total data footprint but delays the need to buy more drives until absolutely necessary.
If storage needs increase rapidly, you must "react very, very quickly" to ensure that you have enough physical storage, says Allen. The more unpredictable your needs, the better measurement and management tools you need if you adopt thin provisioning. Schulz advises looking for products that identify both the data and applications users need to track, and that monitor not only space usage but read/write operations to prevent bottlenecks.
One of the vendors in this market is IBM, which has extended thin provisioning "into all our storage controllers," says Balog. HP, which provides thin provisioning on its P4000 SANs, is set to acquire 3Par, which guarantees that its Utility Storage product will reduce customers' storage needs by 50%. Nexsan provides thin provisioning with its SATABeast arrays.
Before choosing a data reduction strategy, set policies to help make tough choices about when to pay for performance and when to save money by cutting your data footprint. Don't focus only on reduction ratios, Schulz says, but remember that you might get more savings with a lower reduction rate on a larger data set.
And don't be confused by vendor terminology. Compression, data deduplication, "change-only" backups and single instancing are all different ways of reducing redundant data. When in doubt, choose your storage reduction tools based on their business benefits and a detailed analysis of your data.
Which Dedupe Is Right for You?
There are deduplication systems to meet many different needs, depending the organization's reduction goals and system setup. Here's a sampling:
* Nexsan provides postprocessing deduplication for primary and archive data with its Assureon system, and for backup data with its DeDupe SG offering. DeDupe SG is based on FalconStor's deduplication software engine File-interface Deduplication System, or FDS. Combined with single instancing of data, this provides typical reduction ratios from 1:5 to 1:15, says Randy Chalfant, vice president of strategy at Nexsan.
* EMC Data Domain deduplication storage systems are for customers who want to keep their existing backup software but move from tape to disk for backup, says Shane Jackson, senior director of product marketing for EMC's backup recovery systems division. Data Domain supports both structured and unstructured data, with deduplication of various lengths of blocks, achieving reductions of 10:1 to 30:1, he says. EMC's Avamar provides source-based backup software with global deduplication, providing 30:1 to 40:1 reductions, says Philip Fote, marketing manager for the backup recovery systems division.
* Ocarina provides sub-file-level deduplication and compression of unstructured data. Its storage optimizers read data from network-attached storage, deduplicate it, compress it and write the optimized files on either the original NAS or a different storage tier. It optimizes the layout based on characteristics such as block sizes, caching strategies and metadata layout for each storage platform, says Greg Schulz, senior analyst at The Server and StorageIO Group. Ocarina is well suited for unstructured data that may not be "handled as efficiently by dedupe alone," says Schulz. Ocarina also resells its technology to vendors such as BlueArc Corp.
* HP's StoreOnce deduplication software currently runs on HP StorageWorks D2D Backup Systems and compresses data before deduplication, for reductions of up to 20:1. In the future, by deploying it across more platforms, it can avoid the problems caused by using multiple deduplication products, says Lee Johns, marketing director for unified storage products in HP's StorageWorks division. He says HP also plans to use StoreOnce to reduce primary storage in high-availability server clusters.
* Symantec Corp.'s forthcoming VirtualStore is designed to reduce storage requirements for virtual machines and the data associated with them by 80% -- especially for virtual desktop implementations. Among other things, it updates only the changes between the "parent" virtual machine and any clones and provides thin provisioning and tiering. VirtualStore will be available in November; future releases will have deduplication capabilities, according to Symantec.