If you think the storage systems in your data centers are out of control, imagine having 449 billion objects in your database, or having to add 40 terabytes of new data each week.
The challenges of managing massive amounts of big data involve storing huge files, creating long-term archives and, of course, making the data accessible.
While data management has always been a key function in corporate IT, "the current frenzy has taken market activity to a whole new level," says Richard Winter, an analyst with Wintercorp Consulting Services, a firm that studies big data trends.
New products appear regularly from established companies and startups alike. Whether it's Hadoop, MapReduce, NoSQL or one of several dozen data warehousing appliances, file systems and new architectures, the data analytics segment is booming, he says.
"We have products to move data, to replicate data and to analyze data on the fly," says Winter. "Scale-out architectures are appearing everywhere as vendors work to address the enormous volumes of data pouring in from social networks, sensors, medical devices and hundreds of other new or greatly expanded data sources."
Some shops know about the challenges inherent in managing really big data all too well. At Amazon.com, Nielsen, Mazda and the Library of Congress, this task has required adopting some innovative approaches to handling billions of objects and petascale storage media, tagging data for quick retrieval and rooting out errors.
Taking a metadata approach
The Library of Congress processes 2.5 petabytes of data each year, which amounts to around 40TB a week. Thomas Youkel, group chief of enterprise systems engineering at the library, estimates the data load will quadruple in the next few years as the library continues to carry out its dual mandates to serve up data for historians and preserve information in all its forms.
The library stores information on 15,000 to 18,000 spinning disks attached to 600 servers in two data centers. Over 90% of the data, or more than 3PB, is stored on a fiber-attached SAN, and the rest is stored on network-attached storage drives.
"The Library of Congress has an interesting model" in that part of the information stored is metadata -- or data about what is stored -- while the other is the actual content, says Greg Schulz, an analyst at consultancy StorageIO. Although plenty of organizations use metadata, Schulz explains that what makes the Library of Congress unique is the sheer size of its data store and the fact that it tags absolutely everything in its collection, including vintage audio recordings, videos, photos and files on other types of media.
The actual content -- which is seldom accessed -- is ideally kept offline and on tape, with perhaps a thumbnail or low-resolution copy kept on disk, Schulz explains. The metadata can reside in a different repository for searching.
The library uses two separate systems as a best practice for preserving data. One is a massive tape library that has 6,000 tape drive slots and uses the IBM General Parallel File System (GPFS). This file system uses a concept similar to metatagging photos at Flickr.com: files are encoded with algorithms that make the data easier to process and retrieve quickly.
A second archive, with about 9,500 tape drive slots, consists of Oracle/Sun tape libraries that use the Sun Quick File System (QFS) with Oracle SL8550 tape libraries.
Another best practice: Every archive is sent to long-term storage, then immediately retrieved to validate the data, then stored again.
Today the library holds around 500 million objects per database, but Youkel expects this number to grow to up to 5 billion objects. To prepare for this growth, Youkel's team has started rethinking the namespace system. "We looking at new file systems that can handle that many objects," he says.
Gene Ruth, a storage analyst at Gartner, says that scaling up and out correctly is critical. When a data store grows beyond 10PB, the time and expense of backing up and otherwise handling all of the files go quickly skyward. One approach: Have one infrastructure in a primary location that handles the ingestion of most of the data, and then have another, secondary long-term archival storage facility.
Splitting files into manageable chunks
Amazon.com, the e-commerce giant that has ventured into cloud services, is quickly becoming one of the largest holders of data in the world, with around 450 billion objects stored in its cloud for its own storage needs and those of its customers. Alyssa Henry, vice president of storage services at Amazon Web Services, explains that that translates to about 1,500 objects for each person in the United States and to one object for every star in the Milky Way galaxy.
Some of the objects in the database are fairly massive -- up to 5TB each -- and could be databases in their own right. Henry says she expects single-object sizes to get as high as 500TB each by 2016.
She says the secret to dealing with massive data is to split the objects into chunks, a process called parallelization.
For its S3 public-cloud storage service, Amazon uses its own custom code to split files into 1000MB pieces. This is a common practice, but what makes Amazon's approach unique is that the file-splitting process occurs in real time.
"This always-available storage architecture is a contrast with some storage systems which move data between what are known as 'archived' and 'live' states, creating a potential delay for data retrieval," Henry explains.
Corrupt files are another challenge that storage administrators have to face when dealing with massive amounts of data. Most companies don't worry about the occasional corrupt file, but when you have 449 billion objects, even low failure rates create a storage challenge.
Amazon uses custom software that analyzes every piece of data for bad memory allocations, calculates checksums and analyzes how fast an error can be repaired to deliver the throughput needed for the cloud storage.
Henry says Amazon's data storage requirements are destined to grow significantly as its customers keep more and more data in its S3 systems. For instance, some users of the company's cloud-based services are storing massive data sets for genome sequencing, and a customer in the U.S. is using the service to store data collected from sensors implanted on cows to track their movements and health. Henry would not predict how big the data collection might get. Facing demands like those, Amazon is prepared to add nodes quickly to scale out as needed, says Henry.
Relying on virtualization
Mazda Motor Corp., with 800 employees in the U.S., manages around 90TB of stored information.
Barry Blakeley, the infrastructure architect in Mazda's North American Operations, says employees and some 900 Mazda car dealerships are generating ever-increasing amounts of data analytics files, marketing materials, business intelligence databases, SharePoint data and more.
"We have virtualized everything, including storage," says Blakeley. The company uses tools from Compellent, now part of Dell, for storage virtualization and Dell PowerVault NX3100 as its SAN, along with VMware systems to host the virtual servers.
Mazda's small IT staff -- Blakeley did not want to provide an exact head count -- is often hard-pressed to do any manual migrations, especially from disk to tape. But virtualization makes the task easier.
The key, says Blakeley, is migrate "stale" data quickly onto tape. He says 80% of Mazda's stored data becomes stale within months, which means that blocks of data aren't accessed at all.
To accommodate these usage patterns, the virtual storage is in a tiered structure: fast solid-state disks connected by Fibre Channel switches for the first tier, which handles 20% of the company's data needs. The remainder of the data is archived to slower disks running at 15,000 rpm on a Fibre Channel system for a second tier, and a third tier of 7,200-rpm disks connected by serial-attached SCSI.
Blakeley says Mazda is putting less and less data on tape -- about 17TB today -- as it continues to virtualize storage.
Overall, the automaker is moving to a "business continuance model" as opposed to a pure disaster-recovery model, he explains. Instead of having backup and offsite storage that would be available to retrieve and restore in a typical disaster-recovery scenario, "we will instead replicate both live and backed-up data to a colocation facility."
In this scenario, Tier 1 applications will be brought online almost immediately in the event of a primary site failure. Other tiers will be restored from backup data that had been replicated to the colocation facility.
Boosting speed with an appliance
The Nielsen Company, the ratings service that helps determine how long TV shows stay on the air, analyzes the audience for local shows in about 20,000 homes and tracks national shows in about 24,000 homes. After various steps -- including calculation, analysis and quality assurance -- the ratings are released to clients within about 24 hours after the initial telecast.
Scott Brown, Nielsen's senior vice president for client insights, says the data is collected in a central processing facility in Florida and some 20TB of data is then stored in Florida and in Ohio. The company uses a series of high-speed SANs and network-attached storage, mostly from EMC, although Brown declined to provide specifics.
Much of the process of generating reports from Nielsen's data warehouses is automated, but there is manual control too. Employees can call up data about a specific report from years earlier, and managers can create custom reports about viewer data.
Fast access to viewer data is business-critical, Brown says, and for that the company uses IBM Netezza appliances for its data warehouses. Tags are automatically added to data to retrieve specific measurements details. For example, Nielsen can find out how many viewers activated surround-sound audio or whether they used a Boxee device for scheduling their shows.
"We have very granular information needs, and we sometimes want the information summarized up to a broader level -- say, for a customized study of viewer habits," says Brown.
Adapting the techniques
These organizations are proving grounds for methods of handling tremendous amounts of data. StorageIO's Schulz says other companies can mimic some of their processes, including running checksums against files, incorporating metadata and using replication to make sure data is always available.
When it comes to handling massive amounts of data, Schulz says the most important point to remember is that it's critical to use technology that matches your organization's needs, not the system that's cheapest or the one that happens to be popular at the moment.
In the end, the biggest lesson may be that while big data poses many challenges, there are many avenues to success.