Jottings on Computation: April 2012

MongoDB has a lot of potential. It's schema-less, dynamically typed, distributed, replicated, fast, has a rich query language, broadly supported, and the list goes on.

One thing I found hard to get my mind around is when to denormalize and when to keep things normalized. The major reason to keep documents denormalized is because there is no JOIN operation, as there is in SQL. This means that it's often faster to store the associated data in the parent document, rather than join it in on demand.

However, there are circumstances in which it's not desirable to denormalize things. For example, if the associated data is not unique to single document or collection, then denormalization will require replicating data, and it would be difficult to keep all copies in sync.

I recently ran into such a situation and decided to do some benchmarking to see how slow such queries were. For the purposes of discussion, I have devised the following example layout that is identical to my original problem, but easier to explain.

Suppose you have sets of items, and each item in each set has ratings by various people. Without JOINs, I was worried about whether it would be slow to query the ratings for all the items of a set. Here I compare two means of querying such an arrangement.

I use the same data in each case, and the end result is the same. There is a collection of items. There is a collection of sets of items. Each set is a document that contains an array of ObjectIDs for items. Then there is a collection of ratings. Each rating is a document listing the person who gave the rating, the item the rating is for, and the rating itself. The question is how fast are various ways to query all the ratings for items in a particular set. In one case, individually.js, I query the ratings collection once for each item in each set. In the other case, I issue one query per set, providing an array, and using the $in operator to get all the ratings for the set of items. While the latter is certainly fewer queries, either way needs to find all the items. Will the one with fewer (but more complicated) queries be faster?

I have put my code into this Gist.

To run the benchmark, run the setup.js script using the Mongo console:

mongo < setup.js

This will use/create a database, benchmarking.

The critical difference between the two test cases lies in how the ratings for a set are queried. This nested pair of for loops queries them one by one.

// Find user 1's ratings for all items in all sets
cur.forEach(function(set) {
set.items.forEach(function(item_id) {
var rating = db.ratings.findOne({user: 1, item_id: item_id});
sum += rating.rating;
});
});

However, there is an alternative, which is to have a single query that provides all the item_ids that are needed for a given set, as an array, using the $in operator to retrieve those ratings:

// Find user 1's ratings for all items in all sets
cur.forEach(function(set) {
var res = db.ratings.find({user: 1, item_id: {$in: set.items}});
res.forEach(function(rating) { sum += rating.rating; });
});

The two benchmark test cases can be run as follows, using the time function to keep track of elapsed runtime:

time mongo < individually.js

And:

time mongo < using_in.js

Benchmarking results:

Individually fetching each rating: 1.76s
Using the $in operator: 0.31s

This is on my rather old macpro.

Thus, for associations like this, $in is clearly the better choice, performing nearly 6x faster. Also in absolute terms, the query is not slow. Each set gets the 20 associated ratings in less than 1ms.

Jottings on Computation

Saturday, April 28, 2012

Are many-to-many associations in MongoDB fast enough?