The CAP theorem says that consistency, availability, and partition-tolerance cannot all be guaranteed within the same system.
Consistency means that views of data should remain the same among distributed nodes, even when operations are done to the data. Availability means that data should be readily available. Partition-tolerance is the ability of they system to live through cutoffs in the network.
The theorem is basically proved as follows. Assume we have two nodes, N1 and N2. Assume that N1 does a write on object v, causing the value to change from v0 to v1. Assume, around the same time, N2 does a read on the object v. N2's copy of v needs to be synchronized with N1's change before N2 does the read. If we assume consistency (N2 should read v as v1), it is impossible to guarantee both availability and partition-tolerance because everything we can do to guarantee that N2's read happens after the synchronization (blocking, central management, etc.) effects either availability or partition-tolerance. So, all three properties cannot be guaranteed at the same time.
I think this actually should have great impact on how programmers write code for clusters and perhaps consider the BASE model more seriously.
Monday, September 19, 2011
Cluster-Based Scalable Network Services
This paper helps lay out the design for services that run on top of clusters. The paper starts out by describing the advantages and challenges of using clusters versus high performance computers like SMPs. The actual design of the scalable network service contains front ends, workers, and various databases. The design concentrates on the BASE (Basically Available, Soft State, Eventual Consistency) rather than ACID (Atomicity, Consistency, Isolation, Durability). However, there can be access to ACID components as well, for things like billing. Then, above the SNS, programmers can use the TACC (Transformation, Aggregation, Caching, and Customization) programming model, which mirrors UNIX filters and pipes.
The model above really lends itself to achieve some interesting properties. There is a very natural way to do degradation if bandwidth is low. Load balancing is easy with a central manager. Services can be easily created because of their specificity. It seems the paper presents a really model way to build clusters and implement services on top of them.
The paper mentions overflow workers. Although it didn't exist at the time, now the cloud be could leveraged to handle these overflow workers.
The model above really lends itself to achieve some interesting properties. There is a very natural way to do degradation if bandwidth is low. Load balancing is easy with a central manager. Services can be easily created because of their specificity. It seems the paper presents a really model way to build clusters and implement services on top of them.
The paper mentions overflow workers. Although it didn't exist at the time, now the cloud be could leveraged to handle these overflow workers.
Wednesday, September 14, 2011
The Datacenter Needs an Operating System
This paper argues the for the need of an operating system on datacenters. The paper makes great points in explaining that there are now many applications for cluster computing, and that an operating system should be created to provide the ecosystem these applications need. An operating system could allow finer-grained resource sharing and data sharing across different cluster applications. It would also provide programming abstractions to provide a uniform interface to the underlying hardware, which may lead to applications becoming more compatible with one another. In addition, OSs designed for the datacenter could implement debugging and monitoring facilities that are so hard to come by with cluster computing. Overall, the paper makes its arguments very well and details exactly why the previous features would improve cluster computing.
One comment I have to make about this is that a really great point was brought up in the resource sharing concerning the role of virtualization. The authors mention that VM migration and VMs in general might simplify the scheduling over a large datacenter. I agree with this, and I think we can take it further to possibly consider proactive load balancing in datacenters. Proactive load balancing has already proven itself to be very effective in the HPC world, so I think it would improve cluster computing quite a bit as well.
Another comment I have is that this paper doesn't really seem to address energy at all. I think one of the great advantages of having an OS for the datacenter is that now, with a broader view of scheduling, it would be possible to shut down entire racks of computers if not needed, and to preemptively to start these up if big batches of jobs are incoming. This would be another great benefit from having OSs for datacenters.
One comment I have to make about this is that a really great point was brought up in the resource sharing concerning the role of virtualization. The authors mention that VM migration and VMs in general might simplify the scheduling over a large datacenter. I agree with this, and I think we can take it further to possibly consider proactive load balancing in datacenters. Proactive load balancing has already proven itself to be very effective in the HPC world, so I think it would improve cluster computing quite a bit as well.
Another comment I have is that this paper doesn't really seem to address energy at all. I think one of the great advantages of having an OS for the datacenter is that now, with a broader view of scheduling, it would be possible to shut down entire racks of computers if not needed, and to preemptively to start these up if big batches of jobs are incoming. This would be another great benefit from having OSs for datacenters.
Monday, September 12, 2011
Graphic Processing Units (GPUs): Computer Architecture, Fifth Edition: A Quantitative Approach, 5th edition , Chapter 4
This reading talks about 3 architectural changes that can increase computation on data: vector architecture, SIMD instruction set extensions for multimedias, and GPUs.
The reading first goes through each of these architectures in close detail. Vector architectures were available before, but unpopular. Increased memory bandwidth and cheaper transistors have allowed them to become viable options today. Bunching operations on a line of operands helps the vector architecture gain great speedups, especially in loops and matrices. In addition, various additions have been made to take care of problems like data dependencies between operands (chaining) and handling of if clauses in the middle of vector operations (vector-mask control)
SIMD extensions are a sort of mini-vector operations. Instead of performing vector operations on 64 64-bit operands, the goal is to target multimedia processing applications, where the common sizes for units are 8-bit or 16-bit. So, the instructions might be on 32 8-bit operands. The SIMD extensions offer a compromise between vector architectures and today's popular architectures.
GPUs are a completely different beast than the other two architectural changes. GPUs are basically processing units with thousands of parallel SIMD lanes, meaning all sorts of parallelization can be exploited with GPUs. Nvidia's CUDA allows programmers to easily express what they wish to be run on the host CPU and what to be run on the GPU.
The reading then goes into a brief tutorial of how to detect loop-level parallelism, and getting rid of dependent computations like recurrence. The paper also discusses other architectural changes that may affect the three main architectural changes and lists factors in these changes: memory bandwidth, compute bandwidth, cache benefits, gather-scatter, and synchronization.
The reading concludes with some fallacies and pitfalls about the previously mentioned architectures with statements, none of which aren't too hard to understand.
Overall, the reading did a great job of discussing the architectural support available these days for data-level parallelism.
The reading first goes through each of these architectures in close detail. Vector architectures were available before, but unpopular. Increased memory bandwidth and cheaper transistors have allowed them to become viable options today. Bunching operations on a line of operands helps the vector architecture gain great speedups, especially in loops and matrices. In addition, various additions have been made to take care of problems like data dependencies between operands (chaining) and handling of if clauses in the middle of vector operations (vector-mask control)
SIMD extensions are a sort of mini-vector operations. Instead of performing vector operations on 64 64-bit operands, the goal is to target multimedia processing applications, where the common sizes for units are 8-bit or 16-bit. So, the instructions might be on 32 8-bit operands. The SIMD extensions offer a compromise between vector architectures and today's popular architectures.
GPUs are a completely different beast than the other two architectural changes. GPUs are basically processing units with thousands of parallel SIMD lanes, meaning all sorts of parallelization can be exploited with GPUs. Nvidia's CUDA allows programmers to easily express what they wish to be run on the host CPU and what to be run on the GPU.
The reading then goes into a brief tutorial of how to detect loop-level parallelism, and getting rid of dependent computations like recurrence. The paper also discusses other architectural changes that may affect the three main architectural changes and lists factors in these changes: memory bandwidth, compute bandwidth, cache benefits, gather-scatter, and synchronization.
The reading concludes with some fallacies and pitfalls about the previously mentioned architectures with statements, none of which aren't too hard to understand.
Overall, the reading did a great job of discussing the architectural support available these days for data-level parallelism.
Solid State Devices (SSDs): PerformanceModeling and Analysis of Flash-based Storage Devices
This paper discusses the black box modeling of SSD devices.
The work extends previous works by coming up with an improved model which performs linear regression, not only on read/write ratio, queue depth, request size, and randomness, but also on write size, read size, write stride, read stride, read randomness, and write randomness. The authors evaluate their models on synthetic I/O traces and 4 real traces, looking for latency, bandwidth, and I/O throughput. Overall, it shows that incorporating the new workload characteristics improve previous models of hard drives significantly and that SSDs are much easier to predict than HDs.
The paper is really great in that it shows a lot of data and a lot of graphs. The authors did a great job capturing the asymmetry between reads/writes to SSD devices through the extra characteristics; the workloads really do capture the different corner cases very well.
It would have been nice if the authors had come up with some use cases to motivate the different workload cases. This could have also expanded into different common setups that people have, like a journaling file-systems or running multiple applications with various I/O needs. It would have been interesting to look at the modeling of SSDs from this high level perspective.
The authors should have mentioned the page sizes for the SSD devices or said that they didn't know.
The steep increase in slope of random writes in Figure 6(c) is truly puzzling. I was under the impression that random and sequential accesses to SSDs are supposed to have similar performance. What could be causing this huge slope? Too bad, the authors didn't talk about it.
It seems, from the results, although prediction of reads are quite good, predictions of writes still need a lot of work. It would have been nice if the authors had included a section that described how future researchers may progress on this problem. What additional characteristics would help better model write performance; perhaps a counter that would count the number of updated pages in the current block. That way the model can be better prepared for block erases and updates.
The work extends previous works by coming up with an improved model which performs linear regression, not only on read/write ratio, queue depth, request size, and randomness, but also on write size, read size, write stride, read stride, read randomness, and write randomness. The authors evaluate their models on synthetic I/O traces and 4 real traces, looking for latency, bandwidth, and I/O throughput. Overall, it shows that incorporating the new workload characteristics improve previous models of hard drives significantly and that SSDs are much easier to predict than HDs.
The paper is really great in that it shows a lot of data and a lot of graphs. The authors did a great job capturing the asymmetry between reads/writes to SSD devices through the extra characteristics; the workloads really do capture the different corner cases very well.
It would have been nice if the authors had come up with some use cases to motivate the different workload cases. This could have also expanded into different common setups that people have, like a journaling file-systems or running multiple applications with various I/O needs. It would have been interesting to look at the modeling of SSDs from this high level perspective.
The authors should have mentioned the page sizes for the SSD devices or said that they didn't know.
The steep increase in slope of random writes in Figure 6(c) is truly puzzling. I was under the impression that random and sequential accesses to SSDs are supposed to have similar performance. What could be causing this huge slope? Too bad, the authors didn't talk about it.
It seems, from the results, although prediction of reads are quite good, predictions of writes still need a lot of work. It would have been nice if the authors had included a section that described how future researchers may progress on this problem. What additional characteristics would help better model write performance; perhaps a counter that would count the number of updated pages in the current block. That way the model can be better prepared for block erases and updates.
Multicore CPUs: Amdahl's Law in the Multicore Era
This short paper talks about the application of Amdahl's Law to multicore chips. Although Amdahl's Law was used to show that hardware manufacturers should concentrate on high-powered single core processors, now in the parallel age, there are many types of computations that are embarrassingly parallel and much more computation can be done on data due to these parallel machines.
The paper looks into 3 types of multicore chips: symmetric, asymmetric, and dynamic. In the symmetric model, all cores use the same amount of resources. With the asymmetric chip, there is a single large core which utilizes a large portion of the resources, and the rest are all symmetric. With the dynamic model, the authors imagine a chip in which the resources can be used for many symmetric cores or for one large serial-processing core. The results show that the asymmetric chip has much greater speedup than the symmetric chip, and the dynamic model obviously has the potential for incredible speedup.
I think this was a great simple way to adapt Amdahl's Law in the parallel age. The authors leverage this well to show what areas of research we should be concentrating on. However, there are a few problems I had with this paper.
The first thing I found questionable with this paper was the limited number of configurations the authors tried, both theoretically and practically. For example, it would have been interesting to explore, theoretically, the example, in which the asymmetric core uses multiple resources for the symmetric part of the chip. Also, practically, it's much more likely that, for the dynamic model, only a few of the cores can be combined to boost the sequential component, as opposed to all the resources.
A depressing point is that the majority of the lines shown in the graphs have f >= 0.9. Even with this small variation in the amount of parallelism, there are huge changes in the slope and height of these lines. A lot of problems exist in which f < 0.9, and these lines will all lie in the area between f = 0.9 and f = 0.5; the area where an increase in cores does not really seem to help speed up the computation.
The final thing I'd like to say is that the authors assume, for the dynamic model, that perf(r) = sqrt(r), just like the other scenarios. However, it is much more likely that perf(r) will be much less than sqrt(r) in practice. The authors should not have used the same function. This leads the graphs to very misleading.
The paper looks into 3 types of multicore chips: symmetric, asymmetric, and dynamic. In the symmetric model, all cores use the same amount of resources. With the asymmetric chip, there is a single large core which utilizes a large portion of the resources, and the rest are all symmetric. With the dynamic model, the authors imagine a chip in which the resources can be used for many symmetric cores or for one large serial-processing core. The results show that the asymmetric chip has much greater speedup than the symmetric chip, and the dynamic model obviously has the potential for incredible speedup.
I think this was a great simple way to adapt Amdahl's Law in the parallel age. The authors leverage this well to show what areas of research we should be concentrating on. However, there are a few problems I had with this paper.
The first thing I found questionable with this paper was the limited number of configurations the authors tried, both theoretically and practically. For example, it would have been interesting to explore, theoretically, the example, in which the asymmetric core uses multiple resources for the symmetric part of the chip. Also, practically, it's much more likely that, for the dynamic model, only a few of the cores can be combined to boost the sequential component, as opposed to all the resources.
A depressing point is that the majority of the lines shown in the graphs have f >= 0.9. Even with this small variation in the amount of parallelism, there are huge changes in the slope and height of these lines. A lot of problems exist in which f < 0.9, and these lines will all lie in the area between f = 0.9 and f = 0.5; the area where an increase in cores does not really seem to help speed up the computation.
The final thing I'd like to say is that the authors assume, for the dynamic model, that perf(r) = sqrt(r), just like the other scenarios. However, it is much more likely that perf(r) will be much less than sqrt(r) in practice. The authors should not have used the same function. This leads the graphs to very misleading.
Wednesday, September 7, 2011
Warehouse-Scale Computing: Entering the Teenage Decade (video presentation, Luiz Andre Barroso, Google)
This talk addresses how cloud computing is changing in today's time.
Barroso does a short introduction of explaining the usefulness of cloud computing, and how the pervasiveness of the Internet has helped cloud computing to really take off.
He discusses how flash may help the cloud providers, but with a few restrictions. The durability of flash lowers the cost-benefit ratio and the overall cost of flash is often not worth it. However, flash is fast and so it would be really great if it could be used in datacenters.
It seems servers now are becoming more energy-efficient, and the energy proportionality slope is getting closer to 1. This is great because most servers are not run at 100% and so we care a lot more about how energy-efficient servers are around while running 20%-50%.
Barroso also makes a very good comment in that although hardware (specifically I/O) has been increasing, a lot of the software stacks have not been updated for the new bandwidths available by the new hardware. I agree that a lot of the computing world has abandoned latency for throughput. Barroso makes a good point in saying now that hardware has improved, we should go back to look how we would redesign software to provide better latencies with the new hardware.
Barroso does a short introduction of explaining the usefulness of cloud computing, and how the pervasiveness of the Internet has helped cloud computing to really take off.
He discusses how flash may help the cloud providers, but with a few restrictions. The durability of flash lowers the cost-benefit ratio and the overall cost of flash is often not worth it. However, flash is fast and so it would be really great if it could be used in datacenters.
It seems servers now are becoming more energy-efficient, and the energy proportionality slope is getting closer to 1. This is great because most servers are not run at 100% and so we care a lot more about how energy-efficient servers are around while running 20%-50%.
Barroso also makes a very good comment in that although hardware (specifically I/O) has been increasing, a lot of the software stacks have not been updated for the new bandwidths available by the new hardware. I agree that a lot of the computing world has abandoned latency for throughput. Barroso makes a good point in saying now that hardware has improved, we should go back to look how we would redesign software to provide better latencies with the new hardware.
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (chapters 3, 4, 7)
This reading talks a lot about the costs of maintaining a datacenter.
Chapter 3 starts off by rationalizing the use of cheap servers over the more expensive servers in datacenters. They show that the increase in performance by using expensive servers don't justify the cost of these servers, and it is much more cost-efficient to buy more low-end servers. Of course, with servers that are too cheap cloud users will notice increases in response time, which is not desirable. Basically, it is a tradeoff.
There is one small thing that I wanted to mention about this chapter, is that the 100 microsecond estimate for network latencies seems really small. In a warehouse full of computers, it seems going from one end to the other end would take a lot longer. Also, the services would probably run on VMs and that would add to the increase in network latency, so I think the 100 microsecond estimate may be an underestimate.
Chapter 4 talks about the utilities required to run a datacenter. It talks about the various types of redundancies that are available for datacenters. The reading talks of how power is distributed to the datacenter using redundant PSUs. The reading shows how typical datacenters remove hot air from datacenters.
Chapter 7 talks about the availability of datacenters. Once you have thousands of very low-end machines, it is very likely that these machines will break. So, multiple topics surrounding this problem are discussed. Something, I found interesting was that it seems that number of restarts required among machines in the datacenter is quite high. Even the ones that are supposed to run well still require around 2-3 restarts a year. I was not aware that these machines had to be restarted so often. It's really interesting that even though some machines have to be restarted, services continue uninterrupted.
Chapter 3 starts off by rationalizing the use of cheap servers over the more expensive servers in datacenters. They show that the increase in performance by using expensive servers don't justify the cost of these servers, and it is much more cost-efficient to buy more low-end servers. Of course, with servers that are too cheap cloud users will notice increases in response time, which is not desirable. Basically, it is a tradeoff.
There is one small thing that I wanted to mention about this chapter, is that the 100 microsecond estimate for network latencies seems really small. In a warehouse full of computers, it seems going from one end to the other end would take a lot longer. Also, the services would probably run on VMs and that would add to the increase in network latency, so I think the 100 microsecond estimate may be an underestimate.
Chapter 4 talks about the utilities required to run a datacenter. It talks about the various types of redundancies that are available for datacenters. The reading talks of how power is distributed to the datacenter using redundant PSUs. The reading shows how typical datacenters remove hot air from datacenters.
Chapter 7 talks about the availability of datacenters. Once you have thousands of very low-end machines, it is very likely that these machines will break. So, multiple topics surrounding this problem are discussed. Something, I found interesting was that it seems that number of restarts required among machines in the datacenter is quite high. Even the ones that are supposed to run well still require around 2-3 restarts a year. I was not aware that these machines had to be restarted so often. It's really interesting that even though some machines have to be restarted, services continue uninterrupted.
Subscribe to:
Posts (Atom)