Wednesday, December 14, 2011

MapReduce: Simplified Data Processing on Large Clusters

This paper introduces MapReduce, a new programming model. The model provides a very natural way to work with data-intensive loads and scales amazingly well. The simplicity of the model allows for lots of different programs to use the model, and many amateur developers should be able to use the system just fine.

There are some interesting implementation tricks that MapReduce uses to get better performance. The two that I liked most are locality and replicating "straggler" tasks. Clusters that run MapReduce try to ensure that the computation work is done on the same node with the data. This ensure that minimal time is spent on waiting for data to travel the network. MapReduce clusters also replicate any tasks that seem to be taking way longer than any others. Often times, the entire MapReduce job is mostly finished, but waiting one a few tasks to complete. By replicating "straggler" tasks, they were able to cut down on the total runtime of the system by quite a bit.

Obviously, MapReduce jobs aren't meant to complete online, but should be used for offline analysis. They're able to complete jobs with high throughput and has had amazing success. Google uses it index the web! So, obviously it has had great impact in the field and will continue to do so. However, recently there has been a trend to try to turn every long-latency computation into a MapReduce job. While the simplicity does allow for such transformations, piping the output of one MapReduce job into another and that into another and that into another isn't going to get such great performance.

There has also been strong opposition from the databases community against MapReduce. They ask why one would want such a model when the RDBMSs already provide everything that MapReduce provides and more, but RDBMSs don't scale as well as MapReduce. The community is saying that MapReduce is basically does the same thing they've done before. I'm personally of the opinion that MapReduce and other projects are all necessary stepping stones for building distributed RDBMSs that scale on the cloud.

No comments:

Post a Comment