This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note that it will likely favor the submitter's approach.
The Big Data trend represents the evolving need to process large amounts of data with a new crop of technology solutions that aren't necessarily your father's database. So, what does a company need to consider when contemplating getting started with Big Data?
First, they need to know what Big Data is. Here is how I define it:
"The emerging technologies and practices that enable the collection, processing, discovery and storage of large volumes of structured and unstructured data quickly and cost-effectively."
Big Data -- from financial trades to human genomes to telemetry sensors in cars to social media interactions to Web logs and beyond -- is expensive to process and store in traditional databases. To solve that problem new technologies leverage open source solutions and commodity hardware to store data efficiently, parallelize workloads and deliver screaming-fast processing power.
As more IT departments research Big Data alternatives, the discussion centers on stacks, processing speeds and platforms. And inasmuch as these IT departments are savvy enough to grasp the limitations of their incumbent technologies, many can't articulate the business value of these alternative solutions, let alone how they will classify and prioritize the data once they identify it. Enter Big Data governance.
In fact as we look at the emerging need for Big Data, the platforms and processes discussions are only part of the overall approach to Big Data delivery. In reality we're seeing seven steps in realizing the full potential of a Big Data development effort:
Collect: Data is collected from the data sources and distributed across multiple nodes -- often a grid -- each of which processes a subset of data in parallel.
Process: The system then uses that same high-powered parallelism to perform fast computations against the data on each node. The nodes then "reduce" the resulting data findings into more consumable data sets to be used by either a human being (in the case of analytics) or machine (in the case of large-scale interpretation of results). [Also see: "Could data scientist be your next job?"]
Manage: Often the Big Data being processed is heterogeneous, originating from different transactional systems. That data usually needs to be understood, defined, annotated, cleansed and audited for security purposes.
Measure: Companies will often measure the rate at which that data can be integrated with other customer behaviors or records and whether the rate of integration or correction is increasing over time. Business requirements should inform the type of measurement and ongoing tracking.
Consume: The resulting use of the data should fit in with the original requirement for the processing. For instance, if bringing in a few hundred terabytes of social media interactions helps us understand whether and how social media data drives additional product purchases, then we should set up rules for how social media data should be accessed and updated. This is equally important for machine-to-machine data access.
Store: As the "data as a service" trend takes shape, increasingly the data stays in a single location as the programs that access it move around. Whether the data is stored for short-term batch processing or longer-term retention, storage solutions should be deliberately addressed.
Data Governance: Data governance is the business-driven policy-making and oversight of data. As defined, data governance applies to each of the six preceding stages of Big Data delivery. By establishing processes and guiding principles it sanctions behaviors around data. And Big Data needs to be governed according to its intended consumption. Otherwise the risk is disaffection of constituents, not to mention overinvestment.
Most staff members charged with researching and acquiring Big Data solutions focus on the Collect and Store steps at the expense of the others. The question is implicit: "How do we gather all these petabytes of data and where do we put 'em all once we have 'em?"
But the processes for defining discrete business requirements for Big Data still elude many IT departments. Business people often see the Big Data trend as just another pretext for IT resume-building with no clear end game. Such an environment of mutual cynicism is the single biggest culprit for why Big Data never transcends the tire-kicking phase.
As IT Business Edge author Lorraine Lawson said in a recent blog post, "The only way to ensure your analysis is sound is to ensure you have a governance program in place for Big Data."
Entrenching data governance processes on behalf of a Big Data effort ensures that:
- Business value and desired outcomes are clear
- Policies for the treatment of key data have been sanctioned
- The right subject matter expertise is applied to the Big Data problem
- Definitions and rules for key data are clear
- There is an escalation process for conflict and questions
- Data management -- the tactical execution of data governance policies -- is deliberate and relevant
- There are decision rights for key issues during development
- Data privacy policies are enforced [Also see: "Panel heats up over big data privacy concerns"]
In short, data governance means that the application of Big Data is useful and relevant. It's an insurance policy that the right questions are being asked. So we won't be squandering the immense power of new Big Data technologies that make processing, storage and delivery speed more cost-effective and nimble than ever.
Jill Dych is vice-president of thought leadership, strategic products for SAS. SAS DataFlux Data Management solutions enable business agility and IT efficiency by providing innovative data management technology and services that transform data into a strategic asset. See www.datafluxinsight.com for the latest education on data governance and Big Data best practices. Read more about data center in Network World's Data Center section.