Wednesday, December 14, 2011

HIVE: Data Warehousing & Analytics on Hadoop and Pig Latin: A Not-So-Foreign Language for Data Processing

Often times, a programmer has to run a bunch of MapReduce jobs before they get the final output they're looking for. Looking for the relationship between these MapReduce jobs and making sure the correct data is piped in can be difficult. HIVE and Pig both concentrate on providing a simple language layer on top of MapReduce to make the job easier for programmers.

HIVE looks a lot like SQL. It's mean to serve as a layer for those that are familiar with databases. It is then compiled down to MapReduce jobs and works on HDFS. Pig is a procedural language with support for UDFs.

It seems both layers are missing some operations due to complexity. However, they both seem to do a good job of supporting the majority of options. I do like the web interface that HIVE has for MapReduce jobs and the debugging support in Pig seems promising. Overall, these layers on top of MapReduce are important because they serve to lower the barrier to anyone who wishes to use the cloud for any sort of computing.

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Dryad can be seen basically as an extension of MapReduce. It models computation of data as a DAG and data may flow between the nodes. The authors show MapReduce as one possible DAG, and that the framework generalizes to even more. In addition, Dryad also adds more methods of communication than MapReduce, which relied on files to pipe one stage into another. In Dryad, nodes may communicate using files, sockets, pipes, and even shared memory.

Dryad does solve some of the problems that I had with MapReduce. It allows for efficient piping from one MapReduce job to another without having to go to disk each time. However, this comes with the cost of complexity. By exposing the programmer with all these knobs to play with, the programmer may feel overwhelmed. Overall, I agree with the sentiment that experts programmers should use Dryad to build simpler platforms which other programmers can use.

Dryad has not had as much success as MapReduce. I believe this is because of all the extra complexity it has brought to the table. MapReduce remains simpler in the minds of people, and so is the go-to solution for a lot of the problems they face. Perhaps Dryad would face better success by masquerading as a MapReduce platform with all these extra knobs that expert programmers can use to build interesting platforms.

MapReduce: Simplified Data Processing on Large Clusters

This paper introduces MapReduce, a new programming model. The model provides a very natural way to work with data-intensive loads and scales amazingly well. The simplicity of the model allows for lots of different programs to use the model, and many amateur developers should be able to use the system just fine.

There are some interesting implementation tricks that MapReduce uses to get better performance. The two that I liked most are locality and replicating "straggler" tasks. Clusters that run MapReduce try to ensure that the computation work is done on the same node with the data. This ensure that minimal time is spent on waiting for data to travel the network. MapReduce clusters also replicate any tasks that seem to be taking way longer than any others. Often times, the entire MapReduce job is mostly finished, but waiting one a few tasks to complete. By replicating "straggler" tasks, they were able to cut down on the total runtime of the system by quite a bit.

Obviously, MapReduce jobs aren't meant to complete online, but should be used for offline analysis. They're able to complete jobs with high throughput and has had amazing success. Google uses it index the web! So, obviously it has had great impact in the field and will continue to do so. However, recently there has been a trend to try to turn every long-latency computation into a MapReduce job. While the simplicity does allow for such transformations, piping the output of one MapReduce job into another and that into another and that into another isn't going to get such great performance.

There has also been strong opposition from the databases community against MapReduce. They ask why one would want such a model when the RDBMSs already provide everything that MapReduce provides and more, but RDBMSs don't scale as well as MapReduce. The community is saying that MapReduce is basically does the same thing they've done before. I'm personally of the opinion that MapReduce and other projects are all necessary stepping stones for building distributed RDBMSs that scale on the cloud.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Megastore seeks to become a highly-scalable storage system that can still support ACID transactions. Essentially, Megastore can be thought of as a bunch of RDBMSs. The data is partitioned into "entity groups". Within, an entity group, the all transactions have ACID semantics. Across entity groups, consistency is weaker. The consistency is ensured by using a two-phase commit model with Paxos to resolve consensus problems.

Overall, Megastore seems like a great way to gain scalability while still maintaining ACID. However, it's unclear to me whether application developers will be able to understand how to partition their data correctly to fully leverage the consistency model in Megastore. It also doesn't seem great that an application developer has to have knowledge of the internals of a storage system to use it effectively. The storage system should optimize for the common cases, and only advanced users should have to have detailed knowledge of the inner system. However, I'm sure future work will build on top of Megastore, so it is still a perfectly valid current solution for the lack of scalable storage systems.

While the use of Paxos does solve the consensus problem presented in the two-phase commit, it seems to add huge latencies. This definitely needs to be addressed as such performance hits will not scale as the data grows.

The Google File System

The Google File System departs traditional file system design in that it cares more about throughput than design. It has a unique design in which all writes are append-only and most reads are sequential. It's actually amazing how far you can get with just these operations. There is a single master node which has all the metadata and it forwards clients to the proper chunkserver. From then on, the client interacts with the chunkserver to get the actual data.

The system is obviously designed to work with big data and I think it fits the Google workload very well. Obviously for archiving purposes, append-only write is perfectly fine, and it's the tradeoff that Google makes to have a distributed, scalable file system. Given the influence that GFS has had on HDFS today, it's obvious that GFS has had an impact in the field, and as more file systems are built to target specific workloads, GFS will have had a lasting impact on the field.

Monday, November 21, 2011

Paxos Made Simple

Paxos Made Simple deals with the distributed consensus problem. In the
algorithm, there are proposers, acceptors, and learners. Proposers send prepare
requests with a number n to the acceptors. The acceptors keep track of
the highest n they have seen and send promises saying that they will not
accept any proposals with a lower n value. Proposers who receive promises
respond back with a value v and the n as their proposal.
Eventually, a majority acceptors come to agree on the same value. However, Paxos
is open a livelock scenario in which two proposers keep sending prepares with
numbers greater than before. For this scenario, one can just have a
distinguished proposer to ensure progress.

Paxos Made Practical shows a detailed implementation of Paxos. However, the
implementation seems to be much more difficult than the algorithm presented in
Paxos Made Simple. It seems this is because implementation must account for the
dynamic addition and subtraction of nodes in the group, and things like
timestamping views are necessary to make sure the nodes among the group have the
same state, as required by the RSM model.

The Chubby lock is meant to provide a simpler implementation for the same sort
of distributed decision making that Paxos provides. However, with the Chubby
lock, the client does not need its own library to synchronize with the other
nodes; instead you access a centralized a lock server. The lock server also has
failover mechanisms to cope with failing masters through leader selection.

Overall, distributed decision making is obviously a useful functionality,
especially for the RSM model, and Paxos and Chubby locks provide ways to do
this. However, it seems Paxos may be too complicated for practical
implementation, and services like Chubby may be more realistic.

Monday, September 19, 2011

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

The CAP theorem says that consistency, availability, and partition-tolerance cannot all be guaranteed within the same system.

Consistency means that views of data should remain the same among distributed nodes, even when operations are done to the data. Availability means that data should be readily available. Partition-tolerance is the ability of they system to live through cutoffs in the network.

The theorem is basically proved as follows. Assume we have two nodes, N1 and N2. Assume that N1 does a write on object v, causing the value to change from v0 to v1. Assume, around the same time, N2 does a read on the object v. N2's copy of v needs to be synchronized with N1's change before N2 does the read. If we assume consistency (N2 should read v as v1), it is impossible to guarantee both availability and partition-tolerance because everything we can do to guarantee that N2's read happens after the synchronization (blocking, central management, etc.) effects either availability or partition-tolerance. So, all three properties cannot be guaranteed at the same time.

I think this actually should have great impact on how programmers write code for clusters and perhaps consider the BASE model more seriously.

Cluster-Based Scalable Network Services

This paper helps lay out the design for services that run on top of clusters. The paper starts out by describing the advantages and challenges of using clusters versus high performance computers like SMPs. The actual design of the scalable network service contains front ends, workers, and various databases. The design concentrates on the BASE (Basically Available, Soft State, Eventual Consistency) rather than ACID (Atomicity, Consistency, Isolation, Durability). However, there can be access to ACID components as well, for things like billing. Then, above the SNS, programmers can use the TACC (Transformation, Aggregation, Caching, and Customization) programming model, which mirrors UNIX filters and pipes.

The model above really lends itself to achieve some interesting properties. There is a very natural way to do degradation if bandwidth is low. Load balancing is easy with a central manager. Services can be easily created because of their specificity. It seems the paper presents a really model way to build clusters and implement services on top of them.

The paper mentions overflow workers. Although it didn't exist at the time, now the cloud be could leveraged to handle these overflow workers.

Wednesday, September 14, 2011

The Datacenter Needs an Operating System

This paper argues the for the need of an operating system on datacenters. The paper makes great points in explaining that there are now many applications for cluster computing, and that an operating system should be created to provide the ecosystem these applications need. An operating system could allow finer-grained resource sharing and data sharing across different cluster applications. It would also provide programming abstractions to provide a uniform interface to the underlying hardware, which may lead to applications becoming more compatible with one another. In addition, OSs designed for the datacenter could implement debugging and monitoring facilities that are so hard to come by with cluster computing. Overall, the paper makes its arguments very well and details exactly why the previous features would improve cluster computing.

One comment I have to make about this is that a really great point was brought up in the resource sharing concerning the role of virtualization. The authors mention that VM migration and VMs in general might simplify the scheduling over a large datacenter. I agree with this, and I think we can take it further to possibly consider proactive load balancing in datacenters. Proactive load balancing has already proven itself to be very effective in the HPC world, so I think it would improve cluster computing quite a bit as well.

Another comment I have is that this paper doesn't really seem to address energy at all. I think one of the great advantages of having an OS for the datacenter is that now, with a broader view of scheduling, it would be possible to shut down entire racks of computers if not needed, and to preemptively to start these up if big batches of jobs are incoming. This would be another great benefit from having OSs for datacenters.

Monday, September 12, 2011

Graphic Processing Units (GPUs): Computer Architecture, Fifth Edition: A Quantitative Approach, 5th edition , Chapter 4

This reading talks about 3 architectural changes that can increase computation on data: vector architecture, SIMD instruction set extensions for multimedias, and GPUs.

The reading first goes through each of these architectures in close detail. Vector architectures were available before, but unpopular. Increased memory bandwidth and cheaper transistors have allowed them to become viable options today. Bunching operations on a line of operands helps the vector architecture gain great speedups, especially in loops and matrices. In addition, various additions have been made to take care of problems like data dependencies between operands (chaining) and handling of if clauses in the middle of vector operations (vector-mask control)

SIMD extensions are a sort of mini-vector operations. Instead of performing vector operations on 64 64-bit operands, the goal is to target multimedia processing applications, where the common sizes for units are 8-bit or 16-bit. So, the instructions might be on 32 8-bit operands. The SIMD extensions offer a compromise between vector architectures and today's popular architectures.

GPUs are a completely different beast than the other two architectural changes. GPUs are basically processing units with thousands of parallel SIMD lanes, meaning all sorts of parallelization can be exploited with GPUs. Nvidia's CUDA allows programmers to easily express what they wish to be run on the host CPU and what to be run on the GPU.

The reading then goes into a brief tutorial of how to detect loop-level parallelism, and getting rid of dependent computations like recurrence. The paper also discusses other architectural changes that may affect the three main architectural changes and lists factors in these changes: memory bandwidth, compute bandwidth, cache benefits, gather-scatter, and synchronization.

The reading concludes with some fallacies and pitfalls about the previously mentioned architectures with statements, none of which aren't too hard to understand.

Overall, the reading did a great job of discussing the architectural support available these days for data-level parallelism.

Solid State Devices (SSDs): PerformanceModeling and Analysis of Flash-based Storage Devices

This paper discusses the black box modeling of SSD devices.

The work extends previous works by coming up with an improved model which performs linear regression, not only on read/write ratio, queue depth, request size, and randomness, but also on write size, read size, write stride, read stride, read randomness, and write randomness. The authors evaluate their models on synthetic I/O traces and 4 real traces, looking for latency, bandwidth, and I/O throughput. Overall, it shows that incorporating the new workload characteristics improve previous models of hard drives significantly and that SSDs are much easier to predict than HDs.

The paper is really great in that it shows a lot of data and a lot of graphs. The authors did a great job capturing the asymmetry between reads/writes to SSD devices through the extra characteristics; the workloads really do capture the different corner cases very well.

It would have been nice if the authors had come up with some use cases to motivate the different workload cases. This could have also expanded into different common setups that people have, like a journaling file-systems or running multiple applications with various I/O needs. It would have been interesting to look at the modeling of SSDs from this high level perspective.

The authors should have mentioned the page sizes for the SSD devices or said that they didn't know.

The steep increase in slope of random writes in Figure 6(c) is truly puzzling. I was under the impression that random and sequential accesses to SSDs are supposed to have similar performance. What could be causing this huge slope? Too bad, the authors didn't talk about it.

It seems, from the results, although prediction of reads are quite good, predictions of writes still need a lot of work. It would have been nice if the authors had included a section that described how future researchers may progress on this problem. What additional characteristics would help better model write performance; perhaps a counter that would count the number of updated pages in the current block. That way the model can be better prepared for block erases and updates.

Multicore CPUs: Amdahl's Law in the Multicore Era

This short paper talks about the application of Amdahl's Law to multicore chips. Although Amdahl's Law was used to show that hardware manufacturers should concentrate on high-powered single core processors, now in the parallel age, there are many types of computations that are embarrassingly parallel and much more computation can be done on data due to these parallel machines.

The paper looks into 3 types of multicore chips: symmetric, asymmetric, and dynamic. In the symmetric model, all cores use the same amount of resources. With the asymmetric chip, there is a single large core which utilizes a large portion of the resources, and the rest are all symmetric. With the dynamic model, the authors imagine a chip in which the resources can be used for many symmetric cores or for one large serial-processing core. The results show that the asymmetric chip has much greater speedup than the symmetric chip, and the dynamic model obviously has the potential for incredible speedup.

I think this was a great simple way to adapt Amdahl's Law in the parallel age. The authors leverage this well to show what areas of research we should be concentrating on. However, there are a few problems I had with this paper.

The first thing I found questionable with this paper was the limited number of configurations the authors tried, both theoretically and practically. For example, it would have been interesting to explore, theoretically, the example, in which the asymmetric core uses multiple resources for the symmetric part of the chip. Also, practically, it's much more likely that, for the dynamic model, only a few of the cores can be combined to boost the sequential component, as opposed to all the resources.

A depressing point is that the majority of the lines shown in the graphs have f >= 0.9. Even with this small variation in the amount of parallelism, there are huge changes in the slope and height of these lines. A lot of problems exist in which f < 0.9, and these lines will all lie in the area between f = 0.9 and f = 0.5; the area where an increase in cores does not really seem to help speed up the computation.

The final thing I'd like to say is that the authors assume, for the dynamic model, that perf(r) = sqrt(r), just like the other scenarios. However, it is much more likely that perf(r) will be much less than sqrt(r) in practice. The authors should not have used the same function. This leads the graphs to very misleading.

Wednesday, September 7, 2011

Warehouse-Scale Computing: Entering the Teenage Decade (video presentation, Luiz Andre Barroso, Google)

This talk addresses how cloud computing is changing in today's time.

Barroso does a short introduction of explaining the usefulness of cloud computing, and how the pervasiveness of the Internet has helped cloud computing to really take off.

He discusses how flash may help the cloud providers, but with a few restrictions. The durability of flash lowers the cost-benefit ratio and the overall cost of flash is often not worth it. However, flash is fast and so it would be really great if it could be used in datacenters.

It seems servers now are becoming more energy-efficient, and the energy proportionality slope is getting closer to 1. This is great because most servers are not run at 100% and so we care a lot more about how energy-efficient servers are around while running 20%-50%.

Barroso also makes a very good comment in that although hardware (specifically I/O) has been increasing, a lot of the software stacks have not been updated for the new bandwidths available by the new hardware. I agree that a lot of the computing world has abandoned latency for throughput. Barroso makes a good point in saying now that hardware has improved, we should go back to look how we would redesign software to provide better latencies with the new hardware.

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (chapters 3, 4, 7)

This reading talks a lot about the costs of maintaining a datacenter.

Chapter 3 starts off by rationalizing the use of cheap servers over the more expensive servers in datacenters. They show that the increase in performance by using expensive servers don't justify the cost of these servers, and it is much more cost-efficient to buy more low-end servers. Of course, with servers that are too cheap cloud users will notice increases in response time, which is not desirable. Basically, it is a tradeoff.

There is one small thing that I wanted to mention about this chapter, is that the 100 microsecond estimate for network latencies seems really small. In a warehouse full of computers, it seems going from one end to the other end would take a lot longer. Also, the services would probably run on VMs and that would add to the increase in network latency, so I think the 100 microsecond estimate may be an underestimate.

Chapter 4 talks about the utilities required to run a datacenter. It talks about the various types of redundancies that are available for datacenters. The reading talks of how power is distributed to the datacenter using redundant PSUs. The reading shows how typical datacenters remove hot air from datacenters.

Chapter 7 talks about the availability of datacenters. Once you have thousands of very low-end machines, it is very likely that these machines will break. So, multiple topics surrounding this problem are discussed. Something, I found interesting was that it seems that number of restarts required among machines in the datacenter is quite high. Even the ones that are supposed to run well still require around 2-3 restarts a year. I was not aware that these machines had to be restarted so often. It's really interesting that even though some machines have to be restarted, services continue uninterrupted.

Wednesday, August 31, 2011

Above the Clouds: A Berkeley View of Cloud Computing

Due to all the media buzz about cloud computing, it seems noone really knows what cloud computing is really about, and it is the aim of this paper to fix that. The paper serves as a really great introduction to the cloud computing research space. The paper lists the advantages of cloud computing over simple SaaS, defines terms for the various parts of the cloud computing model, and poses some prelimenary challenges that need to be solved for cloud computing to be truly successful.

The authors do a really great job of detailing the economics around cloud computing. The paper surveys the costs, benefits, and models of current cloud providers. It also talks about the problems that accompany to owning your own datacenter: underprovisioning and overprovisioning. Clear concrete examples, regarding rapid peaks and troughs, show that cloud computing is really a much better model for most web services, due the elasticity of utility computing services.


The paper also provides an equation which is supposed to tell you whether you should switch to cloud computing or not. Although the equation does its job, I do think that it doesn't really express one of the key economic reason as to why cloud computing is superior: it doesn't show that cloud computing is able to easily handle spikes in traffic due to its elasticity.

In addition, although the paper does mention the problem with data confidentiality and the cloud, one of the big reasons why a lot of companies refuse to use the cloud, the paper seems to sort of dismiss it by just saying encrypt the data to the cloud you want to run in at. However, then the data is exposed during execution time while the service is running on the cloud. Some of these companies would like to keep the data secret during runtime as well because the code is running on a different company's machine. Although TPMs sort of solve this problem, they're not foolproof and increase the cost of commodity servers in the cloud, so cloud providers may refuse to get them.

The paper also talks about possibly using flash for memory and storage to aid the problem with I/O interference on commodity machines. However, flash, is a relatively new technology and is expensive. Because the cloud providers have a much bigger incentive to keep their servers cheap, it's unlikely that the flash will be bought for each server.

The paper suggests the way to solve availability of service is to use multiple cloud providers. However, there aren't very many cloud providers, and all of them have different environments. This makes it hard for the developers to write services that will utilize multiple cloud providers, and so availability of service is not quite a solved problem.

There also seems to be a lack of emphasis on the legal issues surrounding cloud computing. For example, if I use EC2 units to provide a SaaS that will do taxes and EC2 goes down right before tax day, who is liable? Shouldn't I, as the appwriter, be able to deflect any lawsuits towards Amazon as I had no fault? This becomes an even bigger mess when I'm in a different country from the cloud provider.