Ask a dozen CIOs what tops their list of strategic priorities and odds are exceedingly good that "big data" ranks either first or second. One of the greatest challenges, they'll tell you, is finding the talent they need to analyze and wring business value from the ever-increasing volume of complex data flooding their enterprises. What they need, they say, are good data scientists -- and lots of them.
In one of the most frequently cited reports on the topic, the McKinsey Global Institute estimates that there will be a shortfall of 190,000 data scientists in the IT job market by 2018.
But how exactly do you become one of these in-demand big data specialists? Is it a matter of training, certification or both? Is it simply the next logical career step for a traditional business intelligence expert? Is a computer science degree required?
As it turns out, there is no one right answer, at least not at the moment. Instead, it's largely a scramble out there on the big data field.
"Big data is like a kids' soccer game. Everyone is running to the ball, but no one knows exactly what to do with it. It has created a huge competition for people," says Greg Meyers, CIO at Biogen Idec in Weston, Mass.
"It's a very fluid area," agrees Michael Rappa, executive director of the Institute for Advanced Analytics at North Carolina State University. "Depending on what industry you're in or what company you talk to, it's a different reality when you talk about big data."
While a single definition might be elusive, academic, career and business experts agree that there are certain fundamental tasks that all data scientists need to perform and certain skills that are required to perform them well. The main pillars of the discipline are data clustering, data correlation, data classification and anomaly detection.
Or, as Rob Bird, a data scientist and CTO at Red Lambda, a provider of predictive security analytics, puts it, "You make data simpler, find relationships, find the weird stuff, and then make predictions."
Data Science vs. Business Intelligence: What's the Difference?
The terms "data science" and "business intelligence" seem to be used a lot in connection with big data, but they're really very different disciplines. Experts say data science is all about predicting the future, while BI involves producing static reports.
"Traditional BI engineers are effectively reporting information as is, even if they're reporting trends and standard deviations away from the norm," says Andrew Dempsey, director of DVD BI and analytics at Netflix. "They aren't really discovering new nuggets of information. The data is what it is."
But with data science, there's an element of mystery. For example, Netflix looks at historical data "to identify why someone is more or less likely to churn because of their behavior," Dempsey explains. "There's more uncertainty there because on an aggregate level, a lot of people may have similar viewing habits, but on an individual level, everyone is different."
Another key difference between the two disciplines has to do with the data itself.
First, there's the sheer volume of data. "With so much data, you need to assimilate it to look at the exceptions, rather than the reports," says Biogen Idec CIO Greg Meyers. The pharmaceutical manufacturer, he says, continually reviews data from signals throughout the manufacturing process to detect when events are out of tolerance levels. When an anomaly is detected, a different operating procedure is triggered. "It's all about trying to make sure the process of how we manufacture is as controlled as possible," Meyers says. "We've matured our analytics process by looking at data across batches so we look at trends to reduce the variability of certain things."
Another challenge is dealing with the variability of big data.
Josh Williams, a data scientist at Kontagent, notes that "in classic BI systems, you usually have highly structured data -- things like customer profiles. You come up with an analysis by correlating that data and running regressions on it."
In today's big data environment, in contrast, "you have a mess of complex data and you have no idea how the features you may be looking at -- the input factors -- relate to the output," Williams says. The upshot is that data science is "much more exploratory. It's easier to shoot yourself in the foot. You have to be much more rigorous. It's much more difficult to do the analysis, which is why there is so much more research around machine learning," he adds.
Universities Step Up
The skills required to perform these tasks cut across traditional academic disciplines, including statistics, mathematics and computer science. This is why several schools, including New York University and NC State, offer specialized data scientist certification and degree programs.
"Data used to be something you collected. It had neat rows and columns," explains Rappa. "You ran experiments that were time-consuming, laborious and costly, and you didn't have a lot of data so you dealt with sample sizes."
Now, in contrast, " data comes streaming off of every touch point you have with employees, partners and customers," he says. "Big data is about taking all of that data together and using it to optimize business or inventory levels or to better target customers. That's the trick of the whole thing. You need people who are good at handling large volumes of data and have knowledge of math and statistics to analyze the data."
Recognizing this as early as 2005, NC State created the Institute for Advanced Analytics, which pulls together faculty members from various disciplines and teaches data science "in a very integrated way," Rappa says. Students take technical courses in statistics, finance and business, and they learn communications and teamwork skills, which Rappa says "almost always trump the technical skills," as far as employers are concerned.
Teamwork skills are critical, he says, because "you can't wrap up all of the [data scientist] skills you need in a single person." (See " Stalking the Elusive Data Scientist.") Instead, data scientists typically work in teams. IBM, for example, mixes statisticians with MBAs in its Data Analytics Center of Excellence, which helps businesspeople determine what questions they need data to answer. The center's goal is to generate revenue through a marriage of business savvy and analytics, says CIO Jeanette Horan. One project optimized sales coverage in the 170 countries in which IBM operates, yielding a 10% performance improvement in territories where the models were applied.
The intensive NC State program, which students attend all day, five days a week for 10 months, awards graduates a master of science degree. Rather than completing a final thesis, students work in teams to complete practicum projects with live data from major companies, including GE and GlaxoSmithKline. Seventy percent of the program's students come from the workforce, many of them sponsored by their employers. Most students have at least two years of on-the-job experience, and their average age is 29. The program costs $21,000 for North Carolina residents and $36,000 for everyone else.
At NYU, the newly launched, two-year master of data science degree is also multidisciplinary, intersecting mathematics, computer science and statistics. This is because to do data science well, "you need to have expertise in all three," says Roy Lowrance, managing director of the university's Center for Data Science.
Lowrance emphasizes that data scientists also require what he calls "application knowledge." Without it, "you have no intuition about what to work on and test, especially in business," he explains.
What Lowrance refers to as application knowledge, some other experts describe as domain expertise. But whatever you call it, all agree that it's absolutely essential for data scientists in the business world.
Because data scientists are charged ultimately with showing business value, knowing a particular business is critical "because there's a lot of nuance in each domain," says Josh Williams, a data scientist at Kontagent, a company that finds and identifies customer behavioral insights from social, mobile and Web data in real time.
"A data scientist is someone who is familiar with statistics and classical mathematical analysis, and they need a strong background in programming and computer science or at least the ability to get things done in a programming language," Williams says. "But they also need domain expertise around how to apply different automated analysis algorithms to a given domain."
However, he adds, "data science skills are not necessarily industry-transferrable" because the volume and complexity of data varies from industry to industry. "We're dealing with orders-of-magnitude greater volumes, but the really important part is that the data is much more rich and complex," Williams says.
The optimal place to gain domain expertise is on the job. But for people interested in improving their technical skills, there are options beyond university programs.
"There are a lot of good math and statistics courses online, and many computer science courses online, too," says NYU's Lowrance. Additionally, vendors in the big data market, such as Cloudera, are developing extensive training programs for would-be big data professionals.
Cloudera offers instructor-led training both in classrooms and online. The training is segmented by professional roles, such as developer and analyst, and by application. For example, students might take a course in developing a recommendation system on Cloudera's big data platform.
One of Cloudera's most popular courses is geared to developers, primarily those using Java. "They may write MapReduce applications, taking a Web log, which is very often used because now it can be stored and analyzed," says Sarah Sproehnle, vice president of educational services at Cloudera. "[Then they'll] do a simple analysis, perhaps counting the number of times various IP addresses access their Web pages. From there, they can expand to forming a geographical look-up to see where their geographical Web activity is coming from."
Cloudera reports that it trained 15,000 developers in 2012, and it offers new classes every week, around the globe.
"The audience we're aiming for are not yet calling themselves data scientists," says Sproehnle. "They may be software engineers or statisticians, and they need to be equipped with what it takes to [operate] in this new big-data-driven environment."
The training does focus exclusively on Cloudera's big data platform, but it also covers more fundamental big data concepts, such as machine learning, classification and clustering, she says.
The company also offers a certification, which Sproehnle says "is beginning to appear on LinkedIn profiles and job descriptions looking to hire big data professionals."
"In technologies this young and new," she adds, certification "offers a level of comfort that [an applicant] has more to offer than that they read a few pages in a book."
Read more about management in Computerworld's Management Topic Center.