Skip to content

Commit 5832883

Browse files
committed
Added LEAP paper.
1 parent 83622ea commit 5832883

File tree

3 files changed

+173
-0
lines changed

3 files changed

+173
-0
lines changed

html/lin2016towards.html

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
<!DOCTYPE html>
2+
<html>
3+
<head>
4+
<title>Papers</title>
5+
<link href='../style.css' rel='stylesheet'>
6+
<meta name=viewport content="width=device-width, initial-scale=1">
7+
</head>
8+
9+
<body>
10+
<div id="container">
11+
<h1 id="towards-a-non-2pc-transaction-management-in-distributed-database-systems-2016"><a href="https://scholar.google.com/scholar?cluster=9359440394568724083">Towards a Non-2PC Transaction Management in Distributed Database Systems (2016)</a></h1>
12+
<p>For a traditional single-node database, the data that a transaction reads and writes is all on a single machine. For a distributed OLTP database, there are two types of transactions:</p>
13+
<ol style="list-style-type: decimal">
14+
<li><strong>Local transactions</strong> are transactions that read and write data on a single machine, much like traditional transactions. A distributed database can process a local transaction as efficiently as a traditional single-node database can.</li>
15+
<li><strong>Distributed transactions</strong> are transactions that read and write data that is spread across multiple machines. Typically, distributed databases use two-phase commit (2PC) to commit or abort distributed transactions. Because 2PC requires multiple rounds of communications, distributed databases process distributed transactions less efficiently than local transactions.</li>
16+
</ol>
17+
<p>This paper presents an alternative to 2PC, dubbed <strong>Localizing Executions via Aggressive Placement of data (LEAP)</strong>, which tries to avoid the communication overheads of 2PC by aggressively moving all the data a distributed transaction reads and writes onto a single machine, effectively turning the distributed transaction into a local transaction.</p>
18+
<h2 id="leap">LEAP</h2>
19+
<p>LEAP is based on the following assumptions and observations:</p>
20+
<ul>
21+
<li>Transactions in an OLTP workload don’t read or write many tuples.</li>
22+
<li>Tuples in an OLTP database are typically very small.</li>
23+
<li>Multiple transactions issued one after another may access the same data again and again.</li>
24+
<li>As more advanced network technology becomes available (e.g. RDMA), the cost of moving data becomes smaller and smaller.</li>
25+
</ul>
26+
<p>With LEAP, tuples are horizontally partitioned across a set of nodes, and each tuple is stored exactly once. Each node has two data structures:</p>
27+
<ul>
28+
<li>a <strong>data table</strong> which stores tuples, and</li>
29+
<li>a horizontally partitioned <strong>owner table</strong> key-value store which stores ownership information.</li>
30+
</ul>
31+
<p>Consider a tuple <code>d = (k, v)</code> with primary key <code>k</code> and value <code>v</code>. The owner table contains an entry<code>(k, o)</code> indicating that node <code>o</code> owns the tuple with key <code>k</code>. The node <code>o</code> contains a <code>(k, v)</code> entry in its data table. The owner table key-value store is partitioned across nodes using any arbitrary partitioning scheme (e.g. hash-based, range-based).</p>
32+
<p>When a node initiates a transaction, it requests ownership of every tuple it reads and writes. This migrates the tuples to the initiating node and updates the ownership information to reflect the ownership transfer. Here’s how the ownership transfer protocol works. For a given tuple <code>d = (k, v)</code>, the <strong>requester</strong> is the node requesting ownership of <code>d</code>, the <strong>partitioner</strong> is the node with ownership information <code>(k, o)</code>, and the owner is the node that stores <code>d</code>.</p>
33+
<ul>
34+
<li>First, the requester sends an <strong>owner request</strong> with key <code>k</code> to the partitioner.</li>
35+
<li>Then, the partitioner looks up the owner of the tuple with <code>k</code> in its owner table and sends a <strong>transfer request</strong> to the owner.</li>
36+
<li>The owner retrieves the value of the tuple and sends it in a <strong>transfer response</strong> back to the requester. It also deletes its copy of the tuple.</li>
37+
<li>Finally, the requester sends an <strong>inform</strong> message to the partitioner informing it that the ownership transfer was complete. The partitioner updates its owner table to reflect the new owner.</li>
38+
</ul>
39+
<p>Also note that</p>
40+
<ul>
41+
<li>if the requester, partitioner, and owner are all different nodes, then this scheme requires <strong>4 messages</strong>,</li>
42+
<li>if the partitioner and owner are the same, then this scheme requires <strong>3 messages</strong>, and</li>
43+
<li>if the requester and partitioner are the same, then this scheme requires <strong>2 messages</strong>.</li>
44+
</ul>
45+
<p>If the transfer request is dropped and the owner deletes the tuple, data is lost. See the appendix for information on how to make this ownership transfer fault tolerant. Also see the paper for a theoretical comparison of 2PC and LEAP.</p>
46+
<h2 id="leap-based-oltp-engine">LEAP-Based OLTP Engine</h2>
47+
<p>L-Store is a distributed OLTP database based on H-Store which uses LEAP to manage transactions. Transactions acquire read/write locks on individual tuples and use strict two-phase locking. Transactions are assigned globally unique identifiers, and deadlock prevention is implemented with a wait-die scheme where lower timestamped transactions have higher priority. That is, higher priority threads wait on lower priority threads, but lower priority threads abort rather than wait on higher priority threads.</p>
48+
<p>Concurrent local transactions are processed as usual; what’s interesting is concurrent transfer requests. Imagine a transaction is requesting ownership of a tuple on another node.</p>
49+
<ul>
50+
<li>First, the requester creates a <strong>request lock</strong> locally indicating that it is currently trying to request ownership of the tuple. It then sends an owner request to the partitioner.</li>
51+
<li>The partitioner may receive multiple concurrent owner requests. It processes them serially using the wait-die scheme. As an optimization, it processes requests in decreasing timestamp order to avoid aborts whenever possible. It then forwards a transfer request to the owner.</li>
52+
<li>If the owner is currently accessing the tuple being requested, it again uses a wait-die scheme to access the tuple before sending it back to the owner.</li>
53+
<li>Finally, the owner changes the request lock into a normal data lock and continues processing.</li>
54+
</ul>
55+
<p>If a transaction cannot successfully get ownership of a tuple, it aborts. L-Store also uses logging and checkpointing for fault tolerance (see paper for details).</p>
56+
<script type="text/x-mathjax-config">
57+
MathJax.Hub.Config({
58+
tex2jax: {
59+
inlineMath: [['$','$'], ['\\(','\\)']],
60+
skipTags: ['script', 'noscript', 'style', 'textarea'],
61+
},
62+
messageStyle: "none",
63+
});
64+
</script>
65+
</div>
66+
</body>
67+
</html>

index.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ <h1 id="indextitle">Papers</h1>
102102
<li><a href="html/halevy2016goods.html">Goods: Organizing Google's Datasets <span class="year">(2016)</span></a></li>
103103
<li><a href="html/chen2016realtime.html">Realtime Data Processing at Facebook <span class="year">(2016)</span></a></li>
104104
<li><a href="html/crooks2016tardis.html">TARDiS: A Branch-and-Merge Approach To Weak Consistency <span class="year">(2016)</span></a></li>
105+
<li><a href="html/lin2016towards.html">Towards a Non-2PC Transaction Management in Distributed Database Systems <span class="year">(2016)</span></a></li>
105106
</ol>
106107
</div>
107108
</body>

papers/lin2016towards.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# [Towards a Non-2PC Transaction Management in Distributed Database Systems (2016)](https://scholar.google.com/scholar?cluster=9359440394568724083)
2+
For a traditional single-node database, the data that a transaction reads and
3+
writes is all on a single machine. For a distributed OLTP database, there are
4+
two types of transactions:
5+
6+
1. **Local transactions** are transactions that read and write data on a single
7+
machine, much like traditional transactions. A distributed database can
8+
process a local transaction as efficiently as a traditional single-node
9+
database can.
10+
2. **Distributed transactions** are transactions that read and write data
11+
that is spread across multiple machines. Typically, distributed databases
12+
use two-phase commit (2PC) to commit or abort distributed transactions.
13+
Because 2PC requires multiple rounds of communications, distributed
14+
databases process distributed transactions less efficiently than local
15+
transactions.
16+
17+
This paper presents an alternative to 2PC, dubbed **Localizing Executions via
18+
Aggressive Placement of data (LEAP)**, which tries to avoid the communication
19+
overheads of 2PC by aggressively moving all the data a distributed transaction
20+
reads and writes onto a single machine, effectively turning the distributed
21+
transaction into a local transaction.
22+
23+
## LEAP
24+
LEAP is based on the following assumptions and observations:
25+
26+
- Transactions in an OLTP workload don't read or write many tuples.
27+
- Tuples in an OLTP database are typically very small.
28+
- Multiple transactions issued one after another may access the same data again
29+
and again.
30+
- As more advanced network technology becomes available (e.g. RDMA), the cost
31+
of moving data becomes smaller and smaller.
32+
33+
With LEAP, tuples are horizontally partitioned across a set of nodes, and each
34+
tuple is stored exactly once. Each node has two data structures:
35+
36+
- a **data table** which stores tuples, and
37+
- a horizontally partitioned **owner table** key-value store which stores
38+
ownership information.
39+
40+
Consider a tuple `d = (k, v)` with primary key `k` and value `v`. The owner
41+
table contains an entry`(k, o)` indicating that node `o` owns the tuple with
42+
key `k`. The node `o` contains a `(k, v)` entry in its data table. The owner
43+
table key-value store is partitioned across nodes using any arbitrary
44+
partitioning scheme (e.g. hash-based, range-based).
45+
46+
When a node initiates a transaction, it requests ownership of every tuple it
47+
reads and writes. This migrates the tuples to the initiating node and updates
48+
the ownership information to reflect the ownership transfer. Here's how the
49+
ownership transfer protocol works. For a given tuple `d = (k, v)`, the
50+
**requester** is the node requesting ownership of `d`, the **partitioner** is
51+
the node with ownership information `(k, o)`, and the owner is the node that
52+
stores `d`.
53+
54+
- First, the requester sends an **owner request** with key `k` to the
55+
partitioner.
56+
- Then, the partitioner looks up the owner of the tuple with `k` in its owner
57+
table and sends a **transfer request** to the owner.
58+
- The owner retrieves the value of the tuple and sends it in a **transfer
59+
response** back to the requester. It also deletes its copy of the tuple.
60+
- Finally, the requester sends an **inform** message to the partitioner
61+
informing it that the ownership transfer was complete. The partitioner
62+
updates its owner table to reflect the new owner.
63+
64+
Also note that
65+
66+
- if the requester, partitioner, and owner are all different nodes, then this
67+
scheme requires **4 messages**,
68+
- if the partitioner and owner are the same, then this scheme requires **3
69+
messages**, and
70+
- if the requester and partitioner are the same, then this scheme requires **2
71+
messages**.
72+
73+
If the transfer request is dropped and the owner deletes the tuple, data is
74+
lost. See the appendix for information on how to make this ownership transfer
75+
fault tolerant. Also see the paper for a theoretical comparison of 2PC and
76+
LEAP.
77+
78+
## LEAP-Based OLTP Engine
79+
L-Store is a distributed OLTP database based on H-Store which uses LEAP to
80+
manage transactions. Transactions acquire read/write locks on individual tuples
81+
and use strict two-phase locking. Transactions are assigned globally unique
82+
identifiers, and deadlock prevention is implemented with a wait-die scheme
83+
where lower timestamped transactions have higher priority. That is, higher
84+
priority threads wait on lower priority threads, but lower priority threads
85+
abort rather than wait on higher priority threads.
86+
87+
Concurrent local transactions are processed as usual; what's interesting is
88+
concurrent transfer requests. Imagine a transaction is requesting ownership of
89+
a tuple on another node.
90+
91+
- First, the requester creates a **request lock** locally indicating that it is
92+
currently trying to request ownership of the tuple. It then sends an owner
93+
request to the partitioner.
94+
- The partitioner may receive multiple concurrent owner requests. It processes
95+
them serially using the wait-die scheme. As an optimization, it processes
96+
requests in decreasing timestamp order to avoid aborts whenever possible. It
97+
then forwards a transfer request to the owner.
98+
- If the owner is currently accessing the tuple being requested, it again uses
99+
a wait-die scheme to access the tuple before sending it back to the owner.
100+
- Finally, the owner changes the request lock into a normal data lock and
101+
continues processing.
102+
103+
If a transaction cannot successfully get ownership of a tuple, it aborts.
104+
L-Store also uses logging and checkpointing for fault tolerance (see paper for
105+
details).

0 commit comments

Comments
 (0)