|
| 1 | +<!DOCTYPE html> |
| 2 | +<html> |
| 3 | +<head> |
| 4 | + <title>Papers</title> |
| 5 | + <link href='../style.css' rel='stylesheet'> |
| 6 | + <meta name=viewport content="width=device-width, initial-scale=1"> |
| 7 | +</head> |
| 8 | + |
| 9 | +<body> |
| 10 | + <div id="container"> |
| 11 | +<style> |
| 12 | + table { |
| 13 | + border-collapse: collapse; |
| 14 | + } |
| 15 | + |
| 16 | + th, td { |
| 17 | + border: 2px solid black; |
| 18 | + min-width: 50px; |
| 19 | + padding: 4pt; |
| 20 | + } |
| 21 | +</style> |
| 22 | + |
| 23 | +<p hidden> |
| 24 | +$\newcommand{\set}[1]{\left\{#1\right\}}$ $\newcommand{\setst}[2]{\left\{#1 \,\middle|\, #2\right\}}$ $\newcommand{\lam}[2]{\lambda #1.\>#2}$ $\newcommand{\typedlam}[3]{\lam{#1\in#2}{#3}}$ $\newcommand{\denote}[1]{[ \! [{#1}] \! ]}$ $\newcommand{\domain}{\textbf{D}}$ $\newcommand{\relations}{\mathcal{R}}$ $\newcommand{\fields}{\mathcal{U}}$ $\newcommand{\getfield}[2]{t \cdot A}$ $\newcommand{\Tuple}{Tuple}$ $\newcommand{\UTuple}{U\text{-}Tuple}$ $\newcommand{\TupleLoc}{TupleLoc}$ $\newcommand{\FieldLoc}{FieldLoc}$ |
| 25 | +</p> |
| 26 | + |
| 27 | +<h1 id="provenance-in-databases-why-how-and-where"><a href="https://scholar.google.com/scholar?cluster=14688264622623487965">Provenance in Databases: Why, How, and Where</a></h1> |
| 28 | +<h2 id="chapter-1-introduction">Chapter 1: Introduction</h2> |
| 29 | +<p><strong>Data provenance</strong>, also known as <strong>data lineage</strong>, describes the origin and history of data as it is moved, copied, transformed, and queried in a data system. In the context of relational databases, provenance will allow us to point at a tuple (or part of a tuple) in the output of a query and ask why or how it got there. In this book, we'll study three forms of provenance known as <em>why-provenance</em>, <em>how-provenance</em>, and <em>where-provenance</em>.</p> |
| 30 | +<h3 id="lineage">Lineage</h3> |
| 31 | +<p>The <strong>lineage</strong> of tuple $t$ in the output of evaluating query $Q$ against database instance $I$ is a subset of the tuples in $I$ (known as a <strong>witness</strong>) that are sufficient for $t$ to appear in the output. Lineage is best explained through an example. Consider the following relations $R$</p> |
| 32 | +<table> |
| 33 | +<thead> |
| 34 | +<tr class="header"> |
| 35 | +<th align="left">id</th> |
| 36 | +<th align="left">A</th> |
| 37 | +</tr> |
| 38 | +</thead> |
| 39 | +<tbody> |
| 40 | +<tr class="odd"> |
| 41 | +<td align="left">$t_1$</td> |
| 42 | +<td align="left">1</td> |
| 43 | +</tr> |
| 44 | +<tr class="even"> |
| 45 | +<td align="left">$t_2$</td> |
| 46 | +<td align="left">2</td> |
| 47 | +</tr> |
| 48 | +</tbody> |
| 49 | +</table> |
| 50 | +<p>and $S$</p> |
| 51 | +<table> |
| 52 | +<thead> |
| 53 | +<tr class="header"> |
| 54 | +<th align="left">id</th> |
| 55 | +<th align="left">A</th> |
| 56 | +<th align="left">B</th> |
| 57 | +</tr> |
| 58 | +</thead> |
| 59 | +<tbody> |
| 60 | +<tr class="odd"> |
| 61 | +<td align="left">$t_3$</td> |
| 62 | +<td align="left">1</td> |
| 63 | +<td align="left">blue</td> |
| 64 | +</tr> |
| 65 | +<tr class="even"> |
| 66 | +<td align="left">$t_4$</td> |
| 67 | +<td align="left">1</td> |
| 68 | +<td align="left">blue</td> |
| 69 | +</tr> |
| 70 | +<tr class="odd"> |
| 71 | +<td align="left">$t_5$</td> |
| 72 | +<td align="left">1</td> |
| 73 | +<td align="left">red</td> |
| 74 | +</tr> |
| 75 | +<tr class="even"> |
| 76 | +<td align="left">$t_6$</td> |
| 77 | +<td align="left">2</td> |
| 78 | +<td align="left">blue</td> |
| 79 | +</tr> |
| 80 | +<tr class="odd"> |
| 81 | +<td align="left">$t_7$</td> |
| 82 | +<td align="left">2</td> |
| 83 | +<td align="left">red</td> |
| 84 | +</tr> |
| 85 | +</tbody> |
| 86 | +</table> |
| 87 | +<p>and consider the query $Q$:</p> |
| 88 | +<pre><code>SELECT R.A |
| 89 | +FROM R, S |
| 90 | +WHERE R.A = S.A AND S.B = blue</code></pre> |
| 91 | +<p>The result of evaluating query $Q$ is:</p> |
| 92 | +<table> |
| 93 | +<thead> |
| 94 | +<tr class="header"> |
| 95 | +<th align="left">id</th> |
| 96 | +<th align="left">A</th> |
| 97 | +</tr> |
| 98 | +</thead> |
| 99 | +<tbody> |
| 100 | +<tr class="odd"> |
| 101 | +<td align="left">$t_8$</td> |
| 102 | +<td align="left">1</td> |
| 103 | +</tr> |
| 104 | +<tr class="even"> |
| 105 | +<td align="left">$t_9$</td> |
| 106 | +<td align="left">2</td> |
| 107 | +</tr> |
| 108 | +</tbody> |
| 109 | +</table> |
| 110 | +<p>The lineage of $t_8$ is $\set{t_1, t_3, t_4}$, and the lineage of $t_9$ is $\set{R(t_2), S(t_6)}$. While the lineage of a tuple $t$ is sufficient for $t$ to appear in the output, the lineage is not necessary. For example, the lineage of $t_8$ does not capture the fact that $t_3$ and $t_4$ do not both have to appear in the input for $t_8$ to appear in the output. In fact, for a given output tuple $t$ there could be an exponential number of sufficient (but not necessary) witnesses for it.</p> |
| 111 | +<h3 id="why-provenance">Why-Provenance</h3> |
| 112 | +<p><strong>Why-provenance</strong> is similar to lineage but tries to avoid considering an exponential number of potential witnesses. Instead, it focuses on a restricted set of witnesses known as the <strong>witness basis</strong>. For example, the witness basis of $t_8$ is $\set{\set{t_1, t_3}, \set{t1_, t_4}}$. A <strong>minimal witness basis</strong> is a witness basis consisting only of minimal witnesses. That is, it won't include two witnesses $w$ and $w'$ where $w \subseteq w'$. The witness basis of two equivalent queries might differ, but the two queries are guaranteed to share the same minimal witness basis.</p> |
| 113 | +<h3 id="how-provenance">How-Provenance</h3> |
| 114 | +<p>Given an output tuple $t$, why-provenance provides witnesses that prove $t$ should appear in the output. However, why-provenance does not tell us <em>how</em> $t$ was formed from a witness. <strong>How-provenance</strong> uses a <strong>provenance semiring</strong> to hint at how an tuple was derived. The semiring consists of polynomials over tuple ids. The polynomial $t^2 + t \cdot t'$ hints at two derivations: one which uses $t$ twice and one which uses $t$ and $t'$.</p> |
| 115 | +<h3 id="where-provenance">Where-Provenance</h3> |
| 116 | +<p><strong>Where-provenance</strong> is very similar to why-provenance except that we'll now point at a particular entry (or <strong>location</strong>) of an output tuple $t$ and ask which input locations it was copied from. For example, the where-provenance of the $A$ entry of tuple $t_8$ is the $A$ entry of tuple $t_3$ or $t_4$.</p> |
| 117 | +<h3 id="eager-vs-lazy">Eager vs Lazy</h3> |
| 118 | +<p>There are two main ways to implement data lineage:</p> |
| 119 | +<ol style="list-style-type: decimal"> |
| 120 | +<li>an <strong>eager</strong> (or <strong>bookkeeping</strong> or <strong>annotating</strong>) approach, and</li> |
| 121 | +<li>a <strong>lazy</strong> (or <strong>non-annotating</strong>) approach.</li> |
| 122 | +</ol> |
| 123 | +<p>In the eager approach, tuples are annotated and their annotations are propagated through the evaluation of a query. The lineage of an output tuple can then be directly determined using its annotations. In the lazy approach, tuples are not annotated. Instead, the lineage of a tuple must be derived by inspecting the query and input database.</p> |
| 124 | +<h3 id="notational-preliminaries">Notational Preliminaries</h3> |
| 125 | +<ul> |
| 126 | +<li>Let $\domain = \set{d_1, \ldots, d_n}$ be a finite domain of data values.</li> |
| 127 | +<li>Let $\fields$ be a collection of <strong>field names</strong> (or <strong>attribute names</strong>) where $U, V \subseteq \fields$.</li> |
| 128 | +<li>A <strong>record</strong> (or <strong>tuple</strong>) $t, u$ is a function $U \to \domain$ written $(A_1:d_1, \ldots, A_n:d_n)$.</li> |
| 129 | +<li>A tuple whose domain is $U$ is said to be a <strong>$U$-tuple</strong>.</li> |
| 130 | +<li>$\Tuple$ is the set of all tuples and $\UTuple$ is the set of all $U$-tuples.</li> |
| 131 | +<li>We write $\getfield{t}{A}$ as a shorthand for $t(A)$.</li> |
| 132 | +<li>We write $t[U]$ as a shorthand for the restriction of $t$ to $U$: $\typedlam{A}{U}{\getfield{t}{A}}$.</li> |
| 133 | +<li>We write $t[A \mapsto B]$ for the renaming of field $A$ to $B$.</li> |
| 134 | +<li>We write $(A: e(A))$ as a shorthand for $\typedlam{A}{U}{e(A)}$.</li> |
| 135 | +<li>A <strong>relation</strong> (or <strong>table</strong>) $r: U$ is a finite set of tuples over $U$.</li> |
| 136 | +<li>$\relations$ is a finite collection of <strong>relation names</strong>.</li> |
| 137 | +<li>A schema $\textbf{R}$ is a function $(R_1:U_1, \ldots, R_n:U_n)$ from $\relations$ to $2^{\fields}$.</li> |
| 138 | +<li>A <strong>database</strong> (or <strong>instance</strong>) $I: \textbf{R}$ is a function mapping each $R_i:U_i \in \textbf{R}$ to a relation $r_i$ over $U_i$.</li> |
| 139 | +<li>A <strong>tuple location</strong> is a tuple tagged with a relation name and is written $(R, t)$. We write $\TupleLoc = \relations \times \Tuple$ for the set of all tagged tuples.</li> |
| 140 | +<li>We can view a database $I$ as $\setst{(R, t)}{t \in I(R)} \subseteq \TupleLoc$.</li> |
| 141 | +<li>A <strong>field location</strong> is a triple $(R, t, A)$ which refers to a particular field or a particular tuple. We let $\FieldLoc$ be the set of all field locations.</li> |
| 142 | +<li>Letting $Y_{\bot} = Y \cup \set{\bot}$, we'll view a partial function $f: X \rightharpoonup Y$ as a total function $f: X \to Y_{\bot}$.</li> |
| 143 | +</ul> |
| 144 | +<p>Finally, this is the syntax of <strong>monotone relation algebra</strong>:</p> |
| 145 | +<pre><code>$$ |
| 146 | +\begin{array}{rrl} |
| 147 | + Q & ::= & R \\ |
| 148 | + & | & \set{t} \\ |
| 149 | + & | & \sigma_{\theta}(Q) \\ |
| 150 | + & | & \pi_{U}(Q) \\ |
| 151 | + & | & Q_1 \bowtie Q_2 \\ |
| 152 | + & | & Q_1 \cup Q_2 \\ |
| 153 | + & | & \rho_{A \mapsto B}(Q) \\ |
| 154 | +\end{array} |
| 155 | +$$</code></pre> |
| 156 | +<p>This is the semantics:</p> |
| 157 | +<pre><code>$$ |
| 158 | +\begin{array}{rrl} |
| 159 | + \denote{R}(I) & = & |
| 160 | + \set{t} \\ |
| 161 | + \denote{\set{t}}(I) & = & |
| 162 | + I(R) \\ |
| 163 | + \denote{\sigma_{\theta}(Q)}(I) & = & |
| 164 | + \setst{t \in \denote{Q}(I)}{\theta(t)} \\ |
| 165 | + \denote{\pi_{U}(Q)}(I) & = & |
| 166 | + \setst{t[U]}{t \in \denote{Q}(I)} \\ |
| 167 | + \denote{Q_1 \bowtie Q_2}(I) & = & |
| 168 | + \setst{t}{t[U_1] \in \denote{Q_1}(I), t[U_2] \in \denote{Q_2}(I)} \\ |
| 169 | + \denote{Q_1 \cup Q_2}(I) & = & |
| 170 | + \denote{Q_1}(I) \cup \denote{Q_2}(I) \\ |
| 171 | + \denote{\rho_{A \mapsto B}(Q)}(I) & = & |
| 172 | + \setst{t[A \mapsto B]}{t \in \denote{Q}(I)} \\ |
| 173 | +\end{array} |
| 174 | +$$</code></pre> |
| 175 | +<h2 id="chapter-2-why-provenance">Chapter 2: Why-Provenance</h2> |
| 176 | +<p>TODO</p> |
| 177 | +<h2 id="chapter-3-how-provenance">Chapter 3: How-Provenance</h2> |
| 178 | +<p>TODO</p> |
| 179 | +<h2 id="chapter-4-where-provenance">Chapter 4: Where-Provenance</h2> |
| 180 | +<p>TODO</p> |
| 181 | +<h2 id="chapter-5-comparing-models-of-provenance">Chapter 5: Comparing Models of Provenance</h2> |
| 182 | +<p>TODO</p> |
| 183 | +<script type="text/javascript" async |
| 184 | + src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML"> |
| 185 | +</script> |
| 186 | + |
| 187 | + |
| 188 | + <script type="text/x-mathjax-config"> |
| 189 | + MathJax.Hub.Config({ |
| 190 | + tex2jax: { |
| 191 | + inlineMath: [['$','$'], ['\\(','\\)']], |
| 192 | + skipTags: ['script', 'noscript', 'style', 'textarea'], |
| 193 | + }, |
| 194 | + messageStyle: "none", |
| 195 | + }); |
| 196 | + </script> |
| 197 | + </div> |
| 198 | +</body> |
| 199 | +</html> |
0 commit comments