I’m currently reading the early access version of MongoDB in Action, so I felt like playing around with it a bit (in fact I do use it for Happynerds.net, but that’s not very exciting). I then remembered that I recently stumbled upon a service called The Exporter, which I had used to download all my tweets from Twitter (as of October 4th 2011). So why not use MongoDB to find out which users I retweet most often?
Since the data provided by The Exporter is already in JSON format, it should be really easy to import it using the mongoimport
tool:
1
|
|
Alas this just threw a lot of errors at me and said that no records got imported. A quick look at the relevant documentation shows that mongoimport
expects one JSON object per line, so I had to remove the opening and closing bracket on the first and last line respectively, as well as the commas separating the lines (s/,$//
). Once I did this, the import worked as expected and I had several thousand tweets in the collection tweets
in the database twitter
.
Since I’ll most likely do similar imports in the future, I added a unique index on the field named id
(note that this is different from Mongo’s built in _id
):
1
|
|
I also added a sparse index on in_reply_to_screen_name
, the most important field for what I was about to do:
1
|
|
For getting the actual data I wanted, MongoDB’s built-in map/reduce functionality seemed like a perfect fit. Here’s the mapping function, in all it’s glory:
1 2 3 |
|
The reduce step is equally simple:
1 2 3 4 5 |
|
To make the actual mapReduce
invocation more readable, I decided to also save the query in a separate object:
1
|
|
With all the pieces in place, I could run mapReduce
to put the result in a new collection named top_retweets
:
1
|
|
Querying the freshly created collection, sorting it in reverse order of value.count
and limiting the result set to 10 elements gave me the top 10 users I retweet (note that this data is most likely not entirely accurate, since The Exporter only got me around 3.2k of the roughly 5k tweets I had at that time):
1 2 3 4 5 6 7 8 9 10 11 |
|
Hm, this is a collection I’ll most likely query more in the future, how did the above query perform? Not so well I’m afraid:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
The most important thing to note here is the difference between the number of documents (nscanned
, 192) and the actual number returned (n
, 10). Also the query had to order the result for us (indicated by "scanAndOrder" : true
). Fortunately that’s nothing another index can’t fix, this time a reverse one on value.count
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
All in all this was a pretty fun little exercise, that proves how easy it is to go from JSON to queryable data with MongoDB. I’ll probably do more of this in the future.