|
| 1 | +# [Calvin: Fast Distributed Transactions for Partitioned Database Systems (2012)](https://scholar.google.com/scholar?cluster=11098336506858442351) |
| 2 | +Some distributed data stores do not support transactions at all (e.g. Dynamo, |
| 3 | +MongoDB). Some restrict transactions to a single row (e.g. BigTable). Some |
| 4 | +support ACID transactions but only single-partition transactions (e.g. |
| 5 | +H-Store). Calvin---when combined with an underlying storage system---is a |
| 6 | +distributed database that supports distributed ACID transactions without |
| 7 | +incurring the overhead of protocols like Paxos or two-phase commit. |
| 8 | + |
| 9 | +## Deterministic Database Systems |
| 10 | +In a traditional distributed database, a node executes a transaction by |
| 11 | +acquiring some locks, reading and writing data, and then participating in a |
| 12 | +distributed commit protocol like two-phase commit. Because these distributed |
| 13 | +commit protocols are slow, the node ends up holding locks for a long period of |
| 14 | +time, a period of time called the **contention footprint**. As contention |
| 15 | +footprints increase, more and more transactions block and the throughput of the |
| 16 | +system goes down. |
| 17 | + |
| 18 | +Calvin shrinks contention footprints by having nodes agree to commit a |
| 19 | +transaction *before* they acquire locks. Once they agree, they *must* execute |
| 20 | +the transaction as planned. They cannot abort. |
| 21 | + |
| 22 | +To understand how to prevent aborts, we first recall why protocols like |
| 23 | +two-phase commit abort in the first place. Traditionally, there are two |
| 24 | +reasons: |
| 25 | + |
| 26 | +1. **Nondeterministic events** like a node failure. |
| 27 | +2. **Deterministic events** like a transaction with an explicit abort. |
| 28 | + |
| 29 | +Traditional commit protocols abort in the face of nondeterministic events, but |
| 30 | +fundamentally don't have to. In order to avoid aborting a transaction in the |
| 31 | +face of node failure, Calvin runs the same transaction on multiple nodes. If |
| 32 | +any one of the nodes fail, the others are still alive to carry the transaction |
| 33 | +to fruition. When the failed node recovers, it can simply recover from another |
| 34 | +replica. |
| 35 | + |
| 36 | +However, if we execute the same batch of transactions on multiple nodes, it's |
| 37 | +possible they may execute in different orders. For example, one node might |
| 38 | +serialize a transaction `T1` before another transaction `T2` while some other |
| 39 | +node might serialize `T2` before `T1`. To prevent replicas from diverging, |
| 40 | +Calvin implements a deterministic concurrency control scheme which ensures that |
| 41 | +all replicas serialize all transactions in the same way. In short, Calvin |
| 42 | +predetermines a global order in which transactions should commit. |
| 43 | + |
| 44 | +<!-- TODO(mwhittaker): Understand this part of the paper. --> |
| 45 | +The paper also argues that deterministic events can be handled in a one-phase |
| 46 | +protocol, though I don't understand the details. |
| 47 | + |
| 48 | +## System Architecture |
| 49 | +Calvin is not a stand-alone database. Rather, it is a piece of software that |
| 50 | +you layer on to an existing storage system. Calvin, along with a storage |
| 51 | +system, has three main layers: |
| 52 | + |
| 53 | +1. The **sequencing layer** globally orders all transactions. Nodes execute |
| 54 | + transactions in a way that is equivalent to this global serial order. |
| 55 | +2. The **scheduling layer** executes transactions. |
| 56 | +3. The **storage layer** stores data. |
| 57 | + |
| 58 | +## Sequencing and Replication |
| 59 | +Clients submit transactions to one of the many sequencing nodes in Calvin. |
| 60 | +Calvin windows the transactions into 10 millisecond epochs. At the end of each |
| 61 | +epoch, a sequencing node will (asynchronously or synchronously) replicate the |
| 62 | +batch of transactions. Then, it will send the relevant transactions to the |
| 63 | +other partitions in its replica. Once a sequencing node receives all the |
| 64 | +transactions during a given epoch, it orders them by unique sequencing node id. |
| 65 | + |
| 66 | +Sequencing nodes can replicate transactions in one of two ways. First, a |
| 67 | +sequencing node can immediately send transactions to other sequencing nodes and |
| 68 | +replicate transactions asynchronously. This makes recovery very complex. |
| 69 | +Second, sequencing nodes in the same **replication group** can run Paxos. |
| 70 | + |
| 71 | +## Scheduling and Concurrency Control |
| 72 | +Calvin transactions are written in C++, and each transaction must provide its |
| 73 | +read and write set up front (more on this momentarily). Each scheduling node |
| 74 | +acquires locks locally and runs two-phase locking with a minor variant: |
| 75 | + |
| 76 | +- If transaction `A` is scheduled before transaction `B` in the global order, |
| 77 | + then `A` must acquire any locks that conflict with `B` before `B` acquires |
| 78 | + them. |
| 79 | + |
| 80 | +Transaction execution proceeds as follows. |
| 81 | + |
| 82 | +1. A node analyzes the read and write set of a transaction to determine which |
| 83 | + reads and writes are remote. |
| 84 | +2. A node performs all local reads. |
| 85 | +3. A node sends its local reads to the other nodes that need them. |
| 86 | +4. A node collects remote reads sent by other nodes. |
| 87 | +5. A node runs the transaction and performs local writes. |
| 88 | + |
| 89 | +Transactions must specify their read and write sets ahead of time, but the read |
| 90 | +and write set of some transactions---dubbed **dependent transactions**---depend |
| 91 | +on values read. To support these transactions, Calvin implements **optimistic |
| 92 | +lock location prediction** (OLLP). First, the transaction is run unreplicated |
| 93 | +and the read and write set is recorded. Then, the transaction is issued again |
| 94 | +with this read and write set. Once the transaction acquires locks, it checks |
| 95 | +that the read set has not changed. |
| 96 | + |
| 97 | +## Calvin with Disk-Based Storage |
| 98 | +Deterministic scheduling means that transactions execute less concurrently. If |
| 99 | +transaction `A` precedes and conflicts with transaction `B`, then `B` has to |
| 100 | +wait for `A` to finish before acquiring locks, fetching data from disk, and |
| 101 | +then executing. Fetching data from disks while holding locks increases the |
| 102 | +contention footprint of the transaction. |
| 103 | + |
| 104 | +To overcome this, a sequencing node does not immediately send a transaction to |
| 105 | +a scheduler if it knows the transaction will end up blocking. Instead, it |
| 106 | +delays sending the transaction and notifies the scheduler to fetch all the |
| 107 | +needed pages into memory. To do this effectively, Calvin must (a) estimate disk |
| 108 | +IO latencies and (b) record which pages have been fetched into memory. The |
| 109 | +mechanism to do this are future work. |
| 110 | + |
| 111 | +## Checkpointing |
| 112 | +Calvin supports three forms of checkpointing for recovery: |
| 113 | + |
| 114 | +1. Naively, Calvin can freeze one replica and snapshot it allowing the other |
| 115 | + replicas to continue processing. |
| 116 | +2. Calvin implements a variant of the Zig-Zag algorithm in which a certain |
| 117 | + point in the global transaction order is marked for checkpoint. All |
| 118 | + transactions that execute after the point write to new versions of the data. |
| 119 | + The old versions are checkpointed. |
| 120 | +3. If the underlying storage system supports multiple versions, Calvin can |
| 121 | + leverage that for checkpointing. |
0 commit comments