Wednesday, December 14, 2011

HIVE: Data Warehousing & Analytics on Hadoop and Pig Latin: A Not-So-Foreign Language for Data Processing

Often times, a programmer has to run a bunch of MapReduce jobs before they get the final output they're looking for. Looking for the relationship between these MapReduce jobs and making sure the correct data is piped in can be difficult. HIVE and Pig both concentrate on providing a simple language layer on top of MapReduce to make the job easier for programmers.

HIVE looks a lot like SQL. It's mean to serve as a layer for those that are familiar with databases. It is then compiled down to MapReduce jobs and works on HDFS. Pig is a procedural language with support for UDFs.

It seems both layers are missing some operations due to complexity. However, they both seem to do a good job of supporting the majority of options. I do like the web interface that HIVE has for MapReduce jobs and the debugging support in Pig seems promising. Overall, these layers on top of MapReduce are important because they serve to lower the barrier to anyone who wishes to use the cloud for any sort of computing.

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Dryad can be seen basically as an extension of MapReduce. It models computation of data as a DAG and data may flow between the nodes. The authors show MapReduce as one possible DAG, and that the framework generalizes to even more. In addition, Dryad also adds more methods of communication than MapReduce, which relied on files to pipe one stage into another. In Dryad, nodes may communicate using files, sockets, pipes, and even shared memory.

Dryad does solve some of the problems that I had with MapReduce. It allows for efficient piping from one MapReduce job to another without having to go to disk each time. However, this comes with the cost of complexity. By exposing the programmer with all these knobs to play with, the programmer may feel overwhelmed. Overall, I agree with the sentiment that experts programmers should use Dryad to build simpler platforms which other programmers can use.

Dryad has not had as much success as MapReduce. I believe this is because of all the extra complexity it has brought to the table. MapReduce remains simpler in the minds of people, and so is the go-to solution for a lot of the problems they face. Perhaps Dryad would face better success by masquerading as a MapReduce platform with all these extra knobs that expert programmers can use to build interesting platforms.

MapReduce: Simplified Data Processing on Large Clusters

This paper introduces MapReduce, a new programming model. The model provides a very natural way to work with data-intensive loads and scales amazingly well. The simplicity of the model allows for lots of different programs to use the model, and many amateur developers should be able to use the system just fine.

There are some interesting implementation tricks that MapReduce uses to get better performance. The two that I liked most are locality and replicating "straggler" tasks. Clusters that run MapReduce try to ensure that the computation work is done on the same node with the data. This ensure that minimal time is spent on waiting for data to travel the network. MapReduce clusters also replicate any tasks that seem to be taking way longer than any others. Often times, the entire MapReduce job is mostly finished, but waiting one a few tasks to complete. By replicating "straggler" tasks, they were able to cut down on the total runtime of the system by quite a bit.

Obviously, MapReduce jobs aren't meant to complete online, but should be used for offline analysis. They're able to complete jobs with high throughput and has had amazing success. Google uses it index the web! So, obviously it has had great impact in the field and will continue to do so. However, recently there has been a trend to try to turn every long-latency computation into a MapReduce job. While the simplicity does allow for such transformations, piping the output of one MapReduce job into another and that into another and that into another isn't going to get such great performance.

There has also been strong opposition from the databases community against MapReduce. They ask why one would want such a model when the RDBMSs already provide everything that MapReduce provides and more, but RDBMSs don't scale as well as MapReduce. The community is saying that MapReduce is basically does the same thing they've done before. I'm personally of the opinion that MapReduce and other projects are all necessary stepping stones for building distributed RDBMSs that scale on the cloud.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Megastore seeks to become a highly-scalable storage system that can still support ACID transactions. Essentially, Megastore can be thought of as a bunch of RDBMSs. The data is partitioned into "entity groups". Within, an entity group, the all transactions have ACID semantics. Across entity groups, consistency is weaker. The consistency is ensured by using a two-phase commit model with Paxos to resolve consensus problems.

Overall, Megastore seems like a great way to gain scalability while still maintaining ACID. However, it's unclear to me whether application developers will be able to understand how to partition their data correctly to fully leverage the consistency model in Megastore. It also doesn't seem great that an application developer has to have knowledge of the internals of a storage system to use it effectively. The storage system should optimize for the common cases, and only advanced users should have to have detailed knowledge of the inner system. However, I'm sure future work will build on top of Megastore, so it is still a perfectly valid current solution for the lack of scalable storage systems.

While the use of Paxos does solve the consensus problem presented in the two-phase commit, it seems to add huge latencies. This definitely needs to be addressed as such performance hits will not scale as the data grows.

The Google File System

The Google File System departs traditional file system design in that it cares more about throughput than design. It has a unique design in which all writes are append-only and most reads are sequential. It's actually amazing how far you can get with just these operations. There is a single master node which has all the metadata and it forwards clients to the proper chunkserver. From then on, the client interacts with the chunkserver to get the actual data.

The system is obviously designed to work with big data and I think it fits the Google workload very well. Obviously for archiving purposes, append-only write is perfectly fine, and it's the tradeoff that Google makes to have a distributed, scalable file system. Given the influence that GFS has had on HDFS today, it's obvious that GFS has had an impact in the field, and as more file systems are built to target specific workloads, GFS will have had a lasting impact on the field.

Monday, November 21, 2011

Paxos Made Simple

Paxos Made Simple deals with the distributed consensus problem. In the
algorithm, there are proposers, acceptors, and learners. Proposers send prepare
requests with a number n to the acceptors. The acceptors keep track of
the highest n they have seen and send promises saying that they will not
accept any proposals with a lower n value. Proposers who receive promises
respond back with a value v and the n as their proposal.
Eventually, a majority acceptors come to agree on the same value. However, Paxos
is open a livelock scenario in which two proposers keep sending prepares with
numbers greater than before. For this scenario, one can just have a
distinguished proposer to ensure progress.

Paxos Made Practical shows a detailed implementation of Paxos. However, the
implementation seems to be much more difficult than the algorithm presented in
Paxos Made Simple. It seems this is because implementation must account for the
dynamic addition and subtraction of nodes in the group, and things like
timestamping views are necessary to make sure the nodes among the group have the
same state, as required by the RSM model.

The Chubby lock is meant to provide a simpler implementation for the same sort
of distributed decision making that Paxos provides. However, with the Chubby
lock, the client does not need its own library to synchronize with the other
nodes; instead you access a centralized a lock server. The lock server also has
failover mechanisms to cope with failing masters through leader selection.

Overall, distributed decision making is obviously a useful functionality,
especially for the RSM model, and Paxos and Chubby locks provide ways to do
this. However, it seems Paxos may be too complicated for practical
implementation, and services like Chubby may be more realistic.

Monday, September 19, 2011

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

The CAP theorem says that consistency, availability, and partition-tolerance cannot all be guaranteed within the same system.

Consistency means that views of data should remain the same among distributed nodes, even when operations are done to the data. Availability means that data should be readily available. Partition-tolerance is the ability of they system to live through cutoffs in the network.

The theorem is basically proved as follows. Assume we have two nodes, N1 and N2. Assume that N1 does a write on object v, causing the value to change from v0 to v1. Assume, around the same time, N2 does a read on the object v. N2's copy of v needs to be synchronized with N1's change before N2 does the read. If we assume consistency (N2 should read v as v1), it is impossible to guarantee both availability and partition-tolerance because everything we can do to guarantee that N2's read happens after the synchronization (blocking, central management, etc.) effects either availability or partition-tolerance. So, all three properties cannot be guaranteed at the same time.

I think this actually should have great impact on how programmers write code for clusters and perhaps consider the BASE model more seriously.