The other day on Twitter I complained about my own laziness when it comes to reading academic papers, and Tom Crinson chimed in to let me know that he’d also like to read more of them. That’s how we decided to set up a sort of online based reading group for this type of thing, where we’ll choose a paper to read, discuss it via email and then blog about it (Tom’s post is here).
The first in this series of blog posts is about Google’s MapReduce framework:
MapReduce: Simplified Data Processing on Large Clusters
Summary (by Tom Crinson)
This is a paper on an architectural software paradigm and framework that Google uses to process or generate vast amounts of data on commodity hardware.
The basic elements are that each task is split into two stages:
- Mapping – a user defined processing task is applied to each job and converted to an intermediate key value format.
- Reducing – a second stage of processing takes in the intermediate output from the mapping stage and performs user defined aggregation functions to generate the desired output.
The paper indicates that many different types of tasks can be split up in this way. Obvious ones for Google are processing pages, indexes and links etc, but people are also using the framework to do artificial intelligence tasks. This variety proofs the flexibility of the paradigm.
The idea itself is pretty simple, we do it a lot in day to day programming: get a set of data, do some operation on all of the elements and then perform some kind of aggregation; sum for example. Google just took it one step further and on a huge scale. They talk about setting up clusters of thousands of machines to do the processing. This is where their framework comes in.
The framework takes all of the pain out of distributing the tasks, setting workers onto individual jobs and handling input and output from each. It also takes care of the nitty gritty of fault tolerance, so for example if a worker dies the master process will notice and assign another worker to re-execute all of the tasks that the dead worker had done. The master will also take care of dealing with de-duplicating the jobs so your results are as they should be. It also has a pretty nice feature to handle “stragglers” (slow tasks): the framework preemptively sets another set of workers to duplicate the effort on the last few jobs just in case there is an issue with a worker machine that is slowing the process down. This was shown to have a huge effect in task completion time, an increase of something like 44% was seen without the backup processing.
In case of master failure the whole process stops, but as it goes along the master has been setting “checkpoints” so in case of a crash, a new master can be started up and can continue on from the last checkpoint. This is a manual process though as the user should probably check to see if any of the input caused the failure.
Another very handy feature is bad record detection. If there is a spurious data element that for some reason causes a worker to hang repeatedly, the master is notified and on the next attempt to process that record set, the bad record is skipped.
There are some other key elements to the system, like splitting he input so it can be assigned to the different workers. This has a default hash function on the key, but can be user specified if you don’t simply want hash(key) mod numWorkers as this is quite simplistic. See an article on consistent hashing for alternatives and why it’s bad. The input data tends to consist of just input files, however an API is exposed to allow custom data providers which is a nice feature.
For me the key takeaway was the following sentence: “[it] allows programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily”, which basically means that MapReduce tries to take an inherently complex task and abstracts all the difficult bits away, so that programmers can focus solely on the data manipulation/transformation aspects.
Difficulty of the paper
Very easy. It’s a simple textual description of MapReduce including good examples. There’s basically no math and the language is easy to understand and not full of computer science jargon, so even less technical oriented people should be able to understand most of it. The presentation also serves an excellent summary.
Examples and further reading, food for thought
I hacked together a little something in IRB to illustrate the basic workings of the algorithm (it’s not perfect, but good enough for our purposes):
1 2 3 4 5 6 7 8 9 10 11