Paper review: The Gamma Database Project
Tuesday, October 11, 2022
I'm putting up some of my notes here from reading the paper. If you read through to the end, there's dessert: a quibble I have with the paper.
My understanding is that this paper was very influential in its time. The architecture it describes is a shared-nothing architecture for distributed databases with very nice scaling properties. Notably, it has linear scale-up and speed-up. These are often related, but they're distinct and both are important to examine.
- Speed up here is measuring how much faster particular queries get if we add more hardware. Since Gamma shows linear speed up it means that if we go from 5 to 10 machines, we should see queries run in half the time.
- Scale up here is measuring how much data can be handled by the system with fixed query times. Since Gamma shows linear scale up, it means that if we double the amount of data stored, and we double the machines in the cluster, then we should keep the same speed of queries.
They're related, but not the same: If a query is only hitting one server, adding more servers won't speed it up, for example.
They presented three key techniques for achieving these properties on commodity hardware:
- Horizontally partitioning data across the cluster (with some nice resiliency properties)
- A good parallel join function based on hashing
- Effective scheduling of jobs onto machines to make use of all available hardware
The overall architecture is pretty typical for databases; we can refer back to the Architecture of a Database System for the overall architecture and just talk about differences.
The main differences come down to partitioning. They employ three different partitioning schemes:
Round-robin: this is the default, and distributes records uniformly across all disk drives. This means that any read must hit all disk drives.
Hashed: a hash function is applied to the input to determine which node gets the data.
Range partitioned: the operator may select which range of data goes to which machines.
Hashed or range partitioned data may hit a subset of machines for queries, which has benefits (potentially more parallel queries, could be faster, less overhead from distribution) and has some drawbacks (more limitation in speedup and scaleup).
They do say that defaulting to splitting all data across all machines by default was a mistake, and that it would be better to base the amount of distribution on some metric. I wasn't clear on why they felt it was a mistake, so I'd like to learn more there.
I went on a nice rabbit hole exploration during reading this paper.
They mentioned they were using commodity hardware, so I was curious what the hardware was and how it has changed to today. It used cutting edge hardware at the time (they complain about being beta testers for some of it, delaying their project), and today it can largely be beat by a single desktop computer. My workstation has nearly as much CPU cache as the cluster had main memory.
The paper was released in 1990 but the hardware was acquired in 1988, so that's 34 years ago. Hardware today should be about 2^17 times "better", or 131,000x, but this may be on multiple axes (both increases in performance and decreases in cost, etc.). (Yes, I know Moore's Law has ended. Don't @ me.)
The hardware they had was 30x Intel 80386 processors, which ran at 16 MHz (one core). (Incidentally, these were still manufactured until 2007, as they were used in embedded applications long after personal computers outgrew them.)
Unfortunately, I can't find much information on cost of this system, but a simliar system was about $300,000 (about $700,000 in 2022). I can buy a system with at least 100x the processing power for at least 1/1000 of the cost, which would be 100,000x improvement, which is right about on the mark for Moore's law!
Saving the beef for last. They had one comment that seemed like mostly an aside, but I feel is not well supported. They state:
"[...] the response time for each of the five selection queries remains almost constant. The slight increase in response time is due to the overhead of initiating a selection and store operation at each site."
I have a few issues with this claim:
- They don't provide a word on how much overhead these operations are (and I'm skeptical that they're high enough overhead to see this effect)
- The increases are not consistent! The times go up, then down, then up again, and it varies with respect to the query being run.
It's not even clear that the experimenters ran the queries multiple times and averaged the results. There's little information on how they gathered this data.
I think there's a much simpler explanation of this relatively minor variance: Simple probability.
In this case, they're increasing the number of nodes from 5 to 30. The operations require data from all nodes to return before they can be finalized. This means that the operation will be as slow as the slowest node. If you take the maximum of 5 random numbers in a range, and then you take the maximum of 30 random numbers in a range, you would generally expect the latter to be higher than the former—but not always!
At any rate, this doesn't really take away from what's an excellent paper to read.
If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.
Want to become a better programmer? Join the Recurse Center!