|
| 1 | +# [Towards a Non-2PC Transaction Management in Distributed Database Systems (2016)](https://scholar.google.com/scholar?cluster=9359440394568724083) |
| 2 | +For a traditional single-node database, the data that a transaction reads and |
| 3 | +writes is all on a single machine. For a distributed OLTP database, there are |
| 4 | +two types of transactions: |
| 5 | + |
| 6 | +1. **Local transactions** are transactions that read and write data on a single |
| 7 | + machine, much like traditional transactions. A distributed database can |
| 8 | + process a local transaction as efficiently as a traditional single-node |
| 9 | + database can. |
| 10 | +2. **Distributed transactions** are transactions that read and write data |
| 11 | + that is spread across multiple machines. Typically, distributed databases |
| 12 | + use two-phase commit (2PC) to commit or abort distributed transactions. |
| 13 | + Because 2PC requires multiple rounds of communications, distributed |
| 14 | + databases process distributed transactions less efficiently than local |
| 15 | + transactions. |
| 16 | + |
| 17 | +This paper presents an alternative to 2PC, dubbed **Localizing Executions via |
| 18 | +Aggressive Placement of data (LEAP)**, which tries to avoid the communication |
| 19 | +overheads of 2PC by aggressively moving all the data a distributed transaction |
| 20 | +reads and writes onto a single machine, effectively turning the distributed |
| 21 | +transaction into a local transaction. |
| 22 | + |
| 23 | +## LEAP |
| 24 | +LEAP is based on the following assumptions and observations: |
| 25 | + |
| 26 | +- Transactions in an OLTP workload don't read or write many tuples. |
| 27 | +- Tuples in an OLTP database are typically very small. |
| 28 | +- Multiple transactions issued one after another may access the same data again |
| 29 | + and again. |
| 30 | +- As more advanced network technology becomes available (e.g. RDMA), the cost |
| 31 | + of moving data becomes smaller and smaller. |
| 32 | + |
| 33 | +With LEAP, tuples are horizontally partitioned across a set of nodes, and each |
| 34 | +tuple is stored exactly once. Each node has two data structures: |
| 35 | + |
| 36 | +- a **data table** which stores tuples, and |
| 37 | +- a horizontally partitioned **owner table** key-value store which stores |
| 38 | + ownership information. |
| 39 | + |
| 40 | +Consider a tuple `d = (k, v)` with primary key `k` and value `v`. The owner |
| 41 | +table contains an entry`(k, o)` indicating that node `o` owns the tuple with |
| 42 | +key `k`. The node `o` contains a `(k, v)` entry in its data table. The owner |
| 43 | +table key-value store is partitioned across nodes using any arbitrary |
| 44 | +partitioning scheme (e.g. hash-based, range-based). |
| 45 | + |
| 46 | +When a node initiates a transaction, it requests ownership of every tuple it |
| 47 | +reads and writes. This migrates the tuples to the initiating node and updates |
| 48 | +the ownership information to reflect the ownership transfer. Here's how the |
| 49 | +ownership transfer protocol works. For a given tuple `d = (k, v)`, the |
| 50 | +**requester** is the node requesting ownership of `d`, the **partitioner** is |
| 51 | +the node with ownership information `(k, o)`, and the owner is the node that |
| 52 | +stores `d`. |
| 53 | + |
| 54 | +- First, the requester sends an **owner request** with key `k` to the |
| 55 | + partitioner. |
| 56 | +- Then, the partitioner looks up the owner of the tuple with `k` in its owner |
| 57 | + table and sends a **transfer request** to the owner. |
| 58 | +- The owner retrieves the value of the tuple and sends it in a **transfer |
| 59 | + response** back to the requester. It also deletes its copy of the tuple. |
| 60 | +- Finally, the requester sends an **inform** message to the partitioner |
| 61 | + informing it that the ownership transfer was complete. The partitioner |
| 62 | + updates its owner table to reflect the new owner. |
| 63 | + |
| 64 | +Also note that |
| 65 | + |
| 66 | +- if the requester, partitioner, and owner are all different nodes, then this |
| 67 | + scheme requires **4 messages**, |
| 68 | +- if the partitioner and owner are the same, then this scheme requires **3 |
| 69 | + messages**, and |
| 70 | +- if the requester and partitioner are the same, then this scheme requires **2 |
| 71 | + messages**. |
| 72 | + |
| 73 | +If the transfer request is dropped and the owner deletes the tuple, data is |
| 74 | +lost. See the appendix for information on how to make this ownership transfer |
| 75 | +fault tolerant. Also see the paper for a theoretical comparison of 2PC and |
| 76 | +LEAP. |
| 77 | + |
| 78 | +## LEAP-Based OLTP Engine |
| 79 | +L-Store is a distributed OLTP database based on H-Store which uses LEAP to |
| 80 | +manage transactions. Transactions acquire read/write locks on individual tuples |
| 81 | +and use strict two-phase locking. Transactions are assigned globally unique |
| 82 | +identifiers, and deadlock prevention is implemented with a wait-die scheme |
| 83 | +where lower timestamped transactions have higher priority. That is, higher |
| 84 | +priority threads wait on lower priority threads, but lower priority threads |
| 85 | +abort rather than wait on higher priority threads. |
| 86 | + |
| 87 | +Concurrent local transactions are processed as usual; what's interesting is |
| 88 | +concurrent transfer requests. Imagine a transaction is requesting ownership of |
| 89 | +a tuple on another node. |
| 90 | + |
| 91 | +- First, the requester creates a **request lock** locally indicating that it is |
| 92 | + currently trying to request ownership of the tuple. It then sends an owner |
| 93 | + request to the partitioner. |
| 94 | +- The partitioner may receive multiple concurrent owner requests. It processes |
| 95 | + them serially using the wait-die scheme. As an optimization, it processes |
| 96 | + requests in decreasing timestamp order to avoid aborts whenever possible. It |
| 97 | + then forwards a transfer request to the owner. |
| 98 | +- If the owner is currently accessing the tuple being requested, it again uses |
| 99 | + a wait-die scheme to access the tuple before sending it back to the owner. |
| 100 | +- Finally, the owner changes the request lock into a normal data lock and |
| 101 | + continues processing. |
| 102 | + |
| 103 | +If a transaction cannot successfully get ownership of a tuple, it aborts. |
| 104 | +L-Store also uses logging and checkpointing for fault tolerance (see paper for |
| 105 | +details). |
0 commit comments