Wednesday, December 14, 2011

HIVE: Data Warehousing & Analytics on Hadoop and Pig Latin: A Not-So-Foreign Language for Data Processing

Often times, a programmer has to run a bunch of MapReduce jobs before they get the final output they're looking for. Looking for the relationship between these MapReduce jobs and making sure the correct data is piped in can be difficult. HIVE and Pig both concentrate on providing a simple language layer on top of MapReduce to make the job easier for programmers.

HIVE looks a lot like SQL. It's mean to serve as a layer for those that are familiar with databases. It is then compiled down to MapReduce jobs and works on HDFS. Pig is a procedural language with support for UDFs.

It seems both layers are missing some operations due to complexity. However, they both seem to do a good job of supporting the majority of options. I do like the web interface that HIVE has for MapReduce jobs and the debugging support in Pig seems promising. Overall, these layers on top of MapReduce are important because they serve to lower the barrier to anyone who wishes to use the cloud for any sort of computing.

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Dryad can be seen basically as an extension of MapReduce. It models computation of data as a DAG and data may flow between the nodes. The authors show MapReduce as one possible DAG, and that the framework generalizes to even more. In addition, Dryad also adds more methods of communication than MapReduce, which relied on files to pipe one stage into another. In Dryad, nodes may communicate using files, sockets, pipes, and even shared memory.

Dryad does solve some of the problems that I had with MapReduce. It allows for efficient piping from one MapReduce job to another without having to go to disk each time. However, this comes with the cost of complexity. By exposing the programmer with all these knobs to play with, the programmer may feel overwhelmed. Overall, I agree with the sentiment that experts programmers should use Dryad to build simpler platforms which other programmers can use.

Dryad has not had as much success as MapReduce. I believe this is because of all the extra complexity it has brought to the table. MapReduce remains simpler in the minds of people, and so is the go-to solution for a lot of the problems they face. Perhaps Dryad would face better success by masquerading as a MapReduce platform with all these extra knobs that expert programmers can use to build interesting platforms.

MapReduce: Simplified Data Processing on Large Clusters

This paper introduces MapReduce, a new programming model. The model provides a very natural way to work with data-intensive loads and scales amazingly well. The simplicity of the model allows for lots of different programs to use the model, and many amateur developers should be able to use the system just fine.

There are some interesting implementation tricks that MapReduce uses to get better performance. The two that I liked most are locality and replicating "straggler" tasks. Clusters that run MapReduce try to ensure that the computation work is done on the same node with the data. This ensure that minimal time is spent on waiting for data to travel the network. MapReduce clusters also replicate any tasks that seem to be taking way longer than any others. Often times, the entire MapReduce job is mostly finished, but waiting one a few tasks to complete. By replicating "straggler" tasks, they were able to cut down on the total runtime of the system by quite a bit.

Obviously, MapReduce jobs aren't meant to complete online, but should be used for offline analysis. They're able to complete jobs with high throughput and has had amazing success. Google uses it index the web! So, obviously it has had great impact in the field and will continue to do so. However, recently there has been a trend to try to turn every long-latency computation into a MapReduce job. While the simplicity does allow for such transformations, piping the output of one MapReduce job into another and that into another and that into another isn't going to get such great performance.

There has also been strong opposition from the databases community against MapReduce. They ask why one would want such a model when the RDBMSs already provide everything that MapReduce provides and more, but RDBMSs don't scale as well as MapReduce. The community is saying that MapReduce is basically does the same thing they've done before. I'm personally of the opinion that MapReduce and other projects are all necessary stepping stones for building distributed RDBMSs that scale on the cloud.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Megastore seeks to become a highly-scalable storage system that can still support ACID transactions. Essentially, Megastore can be thought of as a bunch of RDBMSs. The data is partitioned into "entity groups". Within, an entity group, the all transactions have ACID semantics. Across entity groups, consistency is weaker. The consistency is ensured by using a two-phase commit model with Paxos to resolve consensus problems.

Overall, Megastore seems like a great way to gain scalability while still maintaining ACID. However, it's unclear to me whether application developers will be able to understand how to partition their data correctly to fully leverage the consistency model in Megastore. It also doesn't seem great that an application developer has to have knowledge of the internals of a storage system to use it effectively. The storage system should optimize for the common cases, and only advanced users should have to have detailed knowledge of the inner system. However, I'm sure future work will build on top of Megastore, so it is still a perfectly valid current solution for the lack of scalable storage systems.

While the use of Paxos does solve the consensus problem presented in the two-phase commit, it seems to add huge latencies. This definitely needs to be addressed as such performance hits will not scale as the data grows.

The Google File System

The Google File System departs traditional file system design in that it cares more about throughput than design. It has a unique design in which all writes are append-only and most reads are sequential. It's actually amazing how far you can get with just these operations. There is a single master node which has all the metadata and it forwards clients to the proper chunkserver. From then on, the client interacts with the chunkserver to get the actual data.

The system is obviously designed to work with big data and I think it fits the Google workload very well. Obviously for archiving purposes, append-only write is perfectly fine, and it's the tradeoff that Google makes to have a distributed, scalable file system. Given the influence that GFS has had on HDFS today, it's obvious that GFS has had an impact in the field, and as more file systems are built to target specific workloads, GFS will have had a lasting impact on the field.