Skip to content

Commit b3a8a32

Browse files
committed
Added Calvin.
1 parent 504c4c6 commit b3a8a32

File tree

3 files changed

+188
-0
lines changed

3 files changed

+188
-0
lines changed

html/thomson2012calvin.html

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
<!DOCTYPE html>
2+
<html>
3+
<head>
4+
<title>Papers</title>
5+
<link href='../css/style.css' rel='stylesheet'>
6+
<meta name=viewport content="width=device-width, initial-scale=1">
7+
</head>
8+
9+
<body>
10+
<div id=header>
11+
<a href="../">Papers</a>
12+
</div>
13+
<div id="container">
14+
<h1 id="calvin-fast-distributed-transactions-for-partitioned-database-systems-2012"><a href="https://scholar.google.com/scholar?cluster=11098336506858442351">Calvin: Fast Distributed Transactions for Partitioned Database Systems (2012)</a></h1>
15+
<p>Some distributed data stores do not support transactions at all (e.g. Dynamo, MongoDB). Some restrict transactions to a single row (e.g. BigTable). Some support ACID transactions but only single-partition transactions (e.g. H-Store). Calvin---when combined with an underlying storage system---is a distributed database that supports distributed ACID transactions without incurring the overhead of protocols like Paxos or two-phase commit.</p>
16+
<h2 id="deterministic-database-systems">Deterministic Database Systems</h2>
17+
<p>In a traditional distributed database, a node executes a transaction by acquiring some locks, reading and writing data, and then participating in a distributed commit protocol like two-phase commit. Because these distributed commit protocols are slow, the node ends up holding locks for a long period of time, a period of time called the <strong>contention footprint</strong>. As contention footprints increase, more and more transactions block and the throughput of the system goes down.</p>
18+
<p>Calvin shrinks contention footprints by having nodes agree to commit a transaction <em>before</em> they acquire locks. Once they agree, they <em>must</em> execute the transaction as planned. They cannot abort.</p>
19+
<p>To understand how to prevent aborts, we first recall why protocols like two-phase commit abort in the first place. Traditionally, there are two reasons:</p>
20+
<ol style="list-style-type: decimal">
21+
<li><strong>Nondeterministic events</strong> like a node failure.</li>
22+
<li><strong>Deterministic events</strong> like a transaction with an explicit abort.</li>
23+
</ol>
24+
<p>Traditional commit protocols abort in the face of nondeterministic events, but fundamentally don't have to. In order to avoid aborting a transaction in the face of node failure, Calvin runs the same transaction on multiple nodes. If any one of the nodes fail, the others are still alive to carry the transaction to fruition. When the failed node recovers, it can simply recover from another replica.</p>
25+
<p>However, if we execute the same batch of transactions on multiple nodes, it's possible they may execute in different orders. For example, one node might serialize a transaction <code>T1</code> before another transaction <code>T2</code> while some other node might serialize <code>T2</code> before <code>T1</code>. To prevent replicas from diverging, Calvin implements a deterministic concurrency control scheme which ensures that all replicas serialize all transactions in the same way. In short, Calvin predetermines a global order in which transactions should commit.</p>
26+
<!-- TODO(mwhittaker): Understand this part of the paper. -->
27+
<p>The paper also argues that deterministic events can be handled in a one-phase protocol, though I don't understand the details.</p>
28+
<h2 id="system-architecture">System Architecture</h2>
29+
<p>Calvin is not a stand-alone database. Rather, it is a piece of software that you layer on to an existing storage system. Calvin, along with a storage system, has three main layers:</p>
30+
<ol style="list-style-type: decimal">
31+
<li>The <strong>sequencing layer</strong> globally orders all transactions. Nodes execute transactions in a way that is equivalent to this global serial order.</li>
32+
<li>The <strong>scheduling layer</strong> executes transactions.</li>
33+
<li>The <strong>storage layer</strong> stores data.</li>
34+
</ol>
35+
<h2 id="sequencing-and-replication">Sequencing and Replication</h2>
36+
<p>Clients submit transactions to one of the many sequencing nodes in Calvin. Calvin windows the transactions into 10 millisecond epochs. At the end of each epoch, a sequencing node will (asynchronously or synchronously) replicate the batch of transactions. Then, it will send the relevant transactions to the other partitions in its replica. Once a sequencing node receives all the transactions during a given epoch, it orders them by unique sequencing node id.</p>
37+
<p>Sequencing nodes can replicate transactions in one of two ways. First, a sequencing node can immediately send transactions to other sequencing nodes and replicate transactions asynchronously. This makes recovery very complex. Second, sequencing nodes in the same <strong>replication group</strong> can run Paxos.</p>
38+
<h2 id="scheduling-and-concurrency-control">Scheduling and Concurrency Control</h2>
39+
<p>Calvin transactions are written in C++, and each transaction must provide its read and write set up front (more on this momentarily). Each scheduling node acquires locks locally and runs two-phase locking with a minor variant:</p>
40+
<ul>
41+
<li>If transaction <code>A</code> is scheduled before transaction <code>B</code> in the global order, then <code>A</code> must acquire any locks that conflict with <code>B</code> before <code>B</code> acquires them.</li>
42+
</ul>
43+
<p>Transaction execution proceeds as follows.</p>
44+
<ol style="list-style-type: decimal">
45+
<li>A node analyzes the read and write set of a transaction to determine which reads and writes are remote.</li>
46+
<li>A node performs all local reads.</li>
47+
<li>A node sends its local reads to the other nodes that need them.</li>
48+
<li>A node collects remote reads sent by other nodes.</li>
49+
<li>A node runs the transaction and performs local writes.</li>
50+
</ol>
51+
<p>Transactions must specify their read and write sets ahead of time, but the read and write set of some transactions---dubbed <strong>dependent transactions</strong>---depend on values read. To support these transactions, Calvin implements <strong>optimistic lock location prediction</strong> (OLLP). First, the transaction is run unreplicated and the read and write set is recorded. Then, the transaction is issued again with this read and write set. Once the transaction acquires locks, it checks that the read set has not changed.</p>
52+
<h2 id="calvin-with-disk-based-storage">Calvin with Disk-Based Storage</h2>
53+
<p>Deterministic scheduling means that transactions execute less concurrently. If transaction <code>A</code> precedes and conflicts with transaction <code>B</code>, then <code>B</code> has to wait for <code>A</code> to finish before acquiring locks, fetching data from disk, and then executing. Fetching data from disks while holding locks increases the contention footprint of the transaction.</p>
54+
<p>To overcome this, a sequencing node does not immediately send a transaction to a scheduler if it knows the transaction will end up blocking. Instead, it delays sending the transaction and notifies the scheduler to fetch all the needed pages into memory. To do this effectively, Calvin must (a) estimate disk IO latencies and (b) record which pages have been fetched into memory. The mechanism to do this are future work.</p>
55+
<h2 id="checkpointing">Checkpointing</h2>
56+
<p>Calvin supports three forms of checkpointing for recovery:</p>
57+
<ol style="list-style-type: decimal">
58+
<li>Naively, Calvin can freeze one replica and snapshot it allowing the other replicas to continue processing.</li>
59+
<li>Calvin implements a variant of the Zig-Zag algorithm in which a certain point in the global transaction order is marked for checkpoint. All transactions that execute after the point write to new versions of the data. The old versions are checkpointed.</li>
60+
<li>If the underlying storage system supports multiple versions, Calvin can leverage that for checkpointing.</li>
61+
</ol>
62+
</div>
63+
64+
<script type="text/javascript" src="../js/mathjax_config.js"></script>
65+
</body>
66+
</html>

index.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@
7272
<li><a href="html/lloyd2011don.html">Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS <span class="year">(2011)</span></a></li>
7373
<li><a href="html/hindman2011mesos.html">Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center <span class="year">(2011)</span></a></li>
7474
<li><a href="html/zhou2011tap.html">TAP: Time-aware Provenance for Distributed Systems <span class="year">(2011)</span></a></li>
75+
<li><a href="html/thomson2012calvin.html">Calvin: Fast Distributed Transactions for Partitioned Database Systems <span class="year">(2012)</span></a></li>
7576
<li><a href="html/kohler2012declarative.html">Declarative Datalog Debugging for Mere Mortals <span class="year">(2012)</span></a></li>
7677
<li><a href="html/zhou2012distributed.html">Distributed Time-aware Provenance <span class="year">(2012)</span></a></li>
7778
<li><a href="html/conway2012logic.html">Logic and Lattices for Distributed Programming <span class="year">(2012)</span></a></li>

papers/thomson2012calvin.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# [Calvin: Fast Distributed Transactions for Partitioned Database Systems (2012)](https://scholar.google.com/scholar?cluster=11098336506858442351)
2+
Some distributed data stores do not support transactions at all (e.g. Dynamo,
3+
MongoDB). Some restrict transactions to a single row (e.g. BigTable). Some
4+
support ACID transactions but only single-partition transactions (e.g.
5+
H-Store). Calvin---when combined with an underlying storage system---is a
6+
distributed database that supports distributed ACID transactions without
7+
incurring the overhead of protocols like Paxos or two-phase commit.
8+
9+
## Deterministic Database Systems
10+
In a traditional distributed database, a node executes a transaction by
11+
acquiring some locks, reading and writing data, and then participating in a
12+
distributed commit protocol like two-phase commit. Because these distributed
13+
commit protocols are slow, the node ends up holding locks for a long period of
14+
time, a period of time called the **contention footprint**. As contention
15+
footprints increase, more and more transactions block and the throughput of the
16+
system goes down.
17+
18+
Calvin shrinks contention footprints by having nodes agree to commit a
19+
transaction *before* they acquire locks. Once they agree, they *must* execute
20+
the transaction as planned. They cannot abort.
21+
22+
To understand how to prevent aborts, we first recall why protocols like
23+
two-phase commit abort in the first place. Traditionally, there are two
24+
reasons:
25+
26+
1. **Nondeterministic events** like a node failure.
27+
2. **Deterministic events** like a transaction with an explicit abort.
28+
29+
Traditional commit protocols abort in the face of nondeterministic events, but
30+
fundamentally don't have to. In order to avoid aborting a transaction in the
31+
face of node failure, Calvin runs the same transaction on multiple nodes. If
32+
any one of the nodes fail, the others are still alive to carry the transaction
33+
to fruition. When the failed node recovers, it can simply recover from another
34+
replica.
35+
36+
However, if we execute the same batch of transactions on multiple nodes, it's
37+
possible they may execute in different orders. For example, one node might
38+
serialize a transaction `T1` before another transaction `T2` while some other
39+
node might serialize `T2` before `T1`. To prevent replicas from diverging,
40+
Calvin implements a deterministic concurrency control scheme which ensures that
41+
all replicas serialize all transactions in the same way. In short, Calvin
42+
predetermines a global order in which transactions should commit.
43+
44+
<!-- TODO(mwhittaker): Understand this part of the paper. -->
45+
The paper also argues that deterministic events can be handled in a one-phase
46+
protocol, though I don't understand the details.
47+
48+
## System Architecture
49+
Calvin is not a stand-alone database. Rather, it is a piece of software that
50+
you layer on to an existing storage system. Calvin, along with a storage
51+
system, has three main layers:
52+
53+
1. The **sequencing layer** globally orders all transactions. Nodes execute
54+
transactions in a way that is equivalent to this global serial order.
55+
2. The **scheduling layer** executes transactions.
56+
3. The **storage layer** stores data.
57+
58+
## Sequencing and Replication
59+
Clients submit transactions to one of the many sequencing nodes in Calvin.
60+
Calvin windows the transactions into 10 millisecond epochs. At the end of each
61+
epoch, a sequencing node will (asynchronously or synchronously) replicate the
62+
batch of transactions. Then, it will send the relevant transactions to the
63+
other partitions in its replica. Once a sequencing node receives all the
64+
transactions during a given epoch, it orders them by unique sequencing node id.
65+
66+
Sequencing nodes can replicate transactions in one of two ways. First, a
67+
sequencing node can immediately send transactions to other sequencing nodes and
68+
replicate transactions asynchronously. This makes recovery very complex.
69+
Second, sequencing nodes in the same **replication group** can run Paxos.
70+
71+
## Scheduling and Concurrency Control
72+
Calvin transactions are written in C++, and each transaction must provide its
73+
read and write set up front (more on this momentarily). Each scheduling node
74+
acquires locks locally and runs two-phase locking with a minor variant:
75+
76+
- If transaction `A` is scheduled before transaction `B` in the global order,
77+
then `A` must acquire any locks that conflict with `B` before `B` acquires
78+
them.
79+
80+
Transaction execution proceeds as follows.
81+
82+
1. A node analyzes the read and write set of a transaction to determine which
83+
reads and writes are remote.
84+
2. A node performs all local reads.
85+
3. A node sends its local reads to the other nodes that need them.
86+
4. A node collects remote reads sent by other nodes.
87+
5. A node runs the transaction and performs local writes.
88+
89+
Transactions must specify their read and write sets ahead of time, but the read
90+
and write set of some transactions---dubbed **dependent transactions**---depend
91+
on values read. To support these transactions, Calvin implements **optimistic
92+
lock location prediction** (OLLP). First, the transaction is run unreplicated
93+
and the read and write set is recorded. Then, the transaction is issued again
94+
with this read and write set. Once the transaction acquires locks, it checks
95+
that the read set has not changed.
96+
97+
## Calvin with Disk-Based Storage
98+
Deterministic scheduling means that transactions execute less concurrently. If
99+
transaction `A` precedes and conflicts with transaction `B`, then `B` has to
100+
wait for `A` to finish before acquiring locks, fetching data from disk, and
101+
then executing. Fetching data from disks while holding locks increases the
102+
contention footprint of the transaction.
103+
104+
To overcome this, a sequencing node does not immediately send a transaction to
105+
a scheduler if it knows the transaction will end up blocking. Instead, it
106+
delays sending the transaction and notifies the scheduler to fetch all the
107+
needed pages into memory. To do this effectively, Calvin must (a) estimate disk
108+
IO latencies and (b) record which pages have been fetched into memory. The
109+
mechanism to do this are future work.
110+
111+
## Checkpointing
112+
Calvin supports three forms of checkpointing for recovery:
113+
114+
1. Naively, Calvin can freeze one replica and snapshot it allowing the other
115+
replicas to continue processing.
116+
2. Calvin implements a variant of the Zig-Zag algorithm in which a certain
117+
point in the global transaction order is marked for checkpoint. All
118+
transactions that execute after the point write to new versions of the data.
119+
The old versions are checkpointed.
120+
3. If the underlying storage system supports multiple versions, Calvin can
121+
leverage that for checkpointing.

0 commit comments

Comments
 (0)