Today's large-scale data analysis may be a high-tech undertaking, but smart data scientists can improve their craft by observing how simple low-tech picture puzzles are solved, said an IBM scientist at the GigaOm conference.
Watching how people put together picture puzzles can reveal "a lot of profound effects that we could bring to big data" analysis, said Jeff Jonas, IBM's chief scientist for entity analytics, speaking Wednesday at one of the more whimsical presentations at the data structure conference in New York.
Data analysis is becoming a more important component to many businesses. IDC estimates enterprises will spend more than US$120 billion by 2015 on analysis systems. IBM estimates that it will reap $16 billion in business analytics revenue by 2015.
But getting useful results from such systems requires careful planning.
In a series of informal experiments, Jonas observed how small groups of friends and family work together to assemble picture puzzles, those involving thousands of separate pieces that could be assembled to form a picture.
"My girlfriend sees her son and three cousins, I see four parallel processor pipelines," he said. To make the challenge a bit harder, he removed some of the puzzle pieces, and, obtaining a second copy of some puzzles, added duplicate pieces.
Puzzles are about assembling small bits of discrete data into larger pictures. In many ways, this is the goal of data analysis as well, namely finding ways of assembling data such that it reveals a bigger pattern.
A lot of organizations make the mistake of practicing "pixel analytics," Jonas said, in which they try to gather too much information from a single data point. The problem is that if too much analysis is done too soon, "you don't have enough context" to make sense of the data, he said.
Context, Jonas explained, means looking at what is around the bit of data, in addition to the data itself. By doing too much stripping and filtering of seemingly useless data, one can lose valuable context. When you see the word "bat," you look at the surrounding data to see what kind of bat it is, be it a baseball bat, a bat of the eyelids or a nocturnal creature, he said.
"Low-quality data can be your friend. You'll be glad you didn't over-clean it," Jonas said. Google, for instance, reaps the benefits of this approach. Sloppy typers will often get a "did you mean this?" suggestion after entering into the search engine a misspelled word. Google provides results to what it surmises is the correct word. Google guesses the correct word using a backlog of incorrectly typed queries.
With puzzles, users first concentrate on assembling one piece with another. Over time, they create small clumps of data, which they can then figure out how to connect to finish the puzzle. The edges and the corners are assembled fairly quickly. What in effect happens is that, as progress on the puzzle proceeds, "you are making faster quality decisions than before," Jonas said. "The computational costs to figure out where a piece goes declines."
Watching his teams put together the faulty puzzles, he noticed a number of interesting traits. One obvious one is that the larger the puzzle, the more time it takes to complete. "As the working space expands the computational effort increases," he said. Ambiguity also increases computational complexity. Puzzle pieces that have the same colors and shapes were harder to fit together than those with distinct details.
"Excessive ambiguity really drives up the computational cost," Jonas said.
Jonas was also impressed with how little information someone needed to get an idea of the image that the puzzle held. After assembling only four pieces, one of his teams was able to guess that its puzzle depicted a Las Vegas vista. "That is not a lot of fidelity to figure that out," he said. Having only about 50 percent of the puzzle pieces fitted together provided enough detail to show the outline of the entire puzzle image. This is good news for organizations unable to capture all the data they are studying -- even a statistical sampling might be enough to provide the big picture, so to speak.
"When you have less than half the observation space, you can make a fairly good claim about what you are seeing," Jonas said.
Also, studying how his teams finish the puzzles gave Jonas a new appreciation in batch processing, he said.
The key to analysis is a mixture of streaming and batch processing. The Apache Hadoop data framework is designed for batch processing, in which a lot of data in a static file is analyzed. This is different from stream processing, in which a continually updated string of data is observed. "Until this project, I didn't know the importance of the little batch jobs," he said.
Batch processing is a bit like "deep reflection," Jonas said. "This is no different than staying at home on the couch mulling what you already know," he said. Instead of just staring at each puzzle piece, participants would try to understand what the puzzle depicted, or how larger chunks of assembled pieces could possibly fit together.
For organizations, the lesson should be clear, Jonas explained. They should analyze data as it comes across the wire, but such analysis should be informed by the results generated by deeper batch processes, he said.
Jonas' talk, while seemingly irreverent, actually illustrated many important lessons of data analysis, said Seth Grimes, an industry analyst focusing on text and content analytics who attended the talk. Among the lessons: Data is important. Context accumulates and real-time streams of data should be augmented with deeper analysis.
"These are great lessons, communicated really effectively," Grimes said.