MongoDB competes on speed and flexibility
- 09 June, 2011 09:26
While debate rages on over the value of nonrelational, or NoSQL, databases, two case studies presented at a New York conference this week point to the benefits of using the MongoDB non-SQL data store instead of a standard relational database.
Representatives from both The New York Times and social networking service Foursquare, speaking at the MongoNYC conference held Wednesday in New York, explained why they used MongoDB. They praised MongoDB's ability to scale up and ingest lots of data, as well as its ease of reconfiguration.
"SQL databases have grown into these weird monstrosities. They don't really map to the problems you actually have, so you try to work around their warts," said Harry Heymann, the Foursquare engineer who oversees the company's servers, during his presentation. "MongoDB is a practical database for problems that engineers in the real world have. It was developed by people who built large-scale Web apps."
For The New York Times, MongoDB has been "awesome for flexible research and development," said Jake Porway, a New York Times data scientist, in his talk. Porway works for the news organization's research and development group, which looks at ways digital technology can enhance the presentation of news.
Porway also praised MongoDB's ability to ingest large amounts of data. "Mongo eats this data up," he said.
The New York Times used MongoDB, the open-source NoSQL data store developed by 10Gen, for its experimental Cascade data visualization tool.
Cascade visually demonstrates how links to New York Times stories get copied by multiple Twitter users, showing how messages get passed from one user to the next. "This was an exploratory tool that helps us understand how people share" information, Porway said.
Cascade depicts the number of people who pass a story link on to others, as well as how long it takes to pass this data around.
The New York Times posts 600 pieces of content every day, often putting links to those pieces of content on Twitter. Links to these stories get rebroadcast across Twitter an average of about 25,000 times a day, Porway said. The Cascade system saves all the Twitter messages, as well as the number of times each story link was forwarded and clicked on. All told, it produces about 100 GB of data each month.
"This allows us to [answer] questions that are really big, like what is the best time of day to tweet? What kinds of tweets get people involved? Is it more important for our automated feeds to tweet, or for our journalists?" Porway said.
The three-dimensional visualizations can show huge spikes in activities, which the user can then dig into to find more details, such as the actual messages.
The three-dimensional visualizations use data that has been collected in MongoDB. One table stores the actual Twitter messages. Another stores the data on the number of times users clicked on a story link, which is provided by link-shortening service Bit.ly. The data store also ingests user access log files from The New York Times' own servers.
Porway noted that the Labs is constantly looking at new ways to analyze the data. He appreciates the fact it is easy to change database structures in MongoDB. For example, relational databases require that each field be associated with a particular data type, which can slow attempts to repurpose the data for new uses, Porway said. MongoDB does not have this requirement. "We are a research group, so we are constantly changing what we are looking for," Porway said.
Speed was another factor in using MongoDB. MongoDB has a distributed architecture, so it can easily scale up a data store across multiple servers. "We're pulling data from a fair number of different sources, so we need someplace where we can really dump data quickly," Porway said.
In the case of Foursquare, MongoDB is now saving all the data generated by the service's users. Formerly, the company used PostGres, but it is in the process of migrating its data off that relational database.
Foursquare is a location-based social network. As users travel about, they can post, or "check in," that they are at a certain location, such as a restaurant. It's designed to help people discover acquaintances who are nearby. Eventually, it will evolve into a city guide, one that can offer recommendations of nearby retail establishments, Heymann said.
Foursquare has 9 million users, who do 3 million "check-ins" per day. So far, the service has amassed about 750 million check-ins across 4 million places. Overall, Foursquare has 2.3 billion records and gets about 15,000 queries per second.
The biggest reason for switching to MongoDB, Heymann said, was for its auto-sharding, or the ability to split a database across different servers. At first, Foursquare kept all its data on a single machine. Eventually, the collection got so big that two machines were needed. Now the service runs across 40 virtual machines, organized into eight clusters, on Amazon's Elastic Compute Cloud (EC2). Heymann noted that he could have written an automatic sharding feature for PostGres but that would have required a lot of work. It was simpler to take advantage of the capability already embedded in MongoDB.
MongoDB also has some other features that Foursquare found handy. One is that MongoDB makes the data accessible in a manner that is more easily understandable for object-oriented programmers, when compared to the syntax required by SQL. SQL "is not the way most programmers think these days," Heymann said.
Another good feature is automatic failover, so that when a node fails for some reason, operations are redirected to the backup node. "This is something we could have done with SQL databases, but again, it is something we didn't have to do" with MongoDB, Heymann said.