Skip to content

Commit cb1ebcf

Browse files
committed
Added intro section of lineage book.
1 parent 5d9b7b6 commit cb1ebcf

File tree

3 files changed

+401
-0
lines changed

3 files changed

+401
-0
lines changed

html/cheney2009provenance.html

Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
<!DOCTYPE html>
2+
<html>
3+
<head>
4+
<title>Papers</title>
5+
<link href='../style.css' rel='stylesheet'>
6+
<meta name=viewport content="width=device-width, initial-scale=1">
7+
</head>
8+
9+
<body>
10+
<div id="container">
11+
<style>
12+
table {
13+
border-collapse: collapse;
14+
}
15+
16+
th, td {
17+
border: 2px solid black;
18+
min-width: 50px;
19+
padding: 4pt;
20+
}
21+
</style>
22+
23+
<p hidden>
24+
$\newcommand{\set}[1]{\left\{#1\right\}}$ $\newcommand{\setst}[2]{\left\{#1 \,\middle|\, #2\right\}}$ $\newcommand{\lam}[2]{\lambda #1.\&gt;#2}$ $\newcommand{\typedlam}[3]{\lam{#1\in#2}{#3}}$ $\newcommand{\denote}[1]{[ \! [{#1}] \! ]}$ $\newcommand{\domain}{\textbf{D}}$ $\newcommand{\relations}{\mathcal{R}}$ $\newcommand{\fields}{\mathcal{U}}$ $\newcommand{\getfield}[2]{t \cdot A}$ $\newcommand{\Tuple}{Tuple}$ $\newcommand{\UTuple}{U\text{-}Tuple}$ $\newcommand{\TupleLoc}{TupleLoc}$ $\newcommand{\FieldLoc}{FieldLoc}$
25+
</p>
26+
27+
<h1 id="provenance-in-databases-why-how-and-where"><a href="https://scholar.google.com/scholar?cluster=14688264622623487965">Provenance in Databases: Why, How, and Where</a></h1>
28+
<h2 id="chapter-1-introduction">Chapter 1: Introduction</h2>
29+
<p><strong>Data provenance</strong>, also known as <strong>data lineage</strong>, describes the origin and history of data as it is moved, copied, transformed, and queried in a data system. In the context of relational databases, provenance will allow us to point at a tuple (or part of a tuple) in the output of a query and ask why or how it got there. In this book, we'll study three forms of provenance known as <em>why-provenance</em>, <em>how-provenance</em>, and <em>where-provenance</em>.</p>
30+
<h3 id="lineage">Lineage</h3>
31+
<p>The <strong>lineage</strong> of tuple $t$ in the output of evaluating query $Q$ against database instance $I$ is a subset of the tuples in $I$ (known as a <strong>witness</strong>) that are sufficient for $t$ to appear in the output. Lineage is best explained through an example. Consider the following relations $R$</p>
32+
<table>
33+
<thead>
34+
<tr class="header">
35+
<th align="left">id</th>
36+
<th align="left">A</th>
37+
</tr>
38+
</thead>
39+
<tbody>
40+
<tr class="odd">
41+
<td align="left">$t_1$</td>
42+
<td align="left">1</td>
43+
</tr>
44+
<tr class="even">
45+
<td align="left">$t_2$</td>
46+
<td align="left">2</td>
47+
</tr>
48+
</tbody>
49+
</table>
50+
<p>and $S$</p>
51+
<table>
52+
<thead>
53+
<tr class="header">
54+
<th align="left">id</th>
55+
<th align="left">A</th>
56+
<th align="left">B</th>
57+
</tr>
58+
</thead>
59+
<tbody>
60+
<tr class="odd">
61+
<td align="left">$t_3$</td>
62+
<td align="left">1</td>
63+
<td align="left">blue</td>
64+
</tr>
65+
<tr class="even">
66+
<td align="left">$t_4$</td>
67+
<td align="left">1</td>
68+
<td align="left">blue</td>
69+
</tr>
70+
<tr class="odd">
71+
<td align="left">$t_5$</td>
72+
<td align="left">1</td>
73+
<td align="left">red</td>
74+
</tr>
75+
<tr class="even">
76+
<td align="left">$t_6$</td>
77+
<td align="left">2</td>
78+
<td align="left">blue</td>
79+
</tr>
80+
<tr class="odd">
81+
<td align="left">$t_7$</td>
82+
<td align="left">2</td>
83+
<td align="left">red</td>
84+
</tr>
85+
</tbody>
86+
</table>
87+
<p>and consider the query $Q$:</p>
88+
<pre><code>SELECT R.A
89+
FROM R, S
90+
WHERE R.A = S.A AND S.B = blue</code></pre>
91+
<p>The result of evaluating query $Q$ is:</p>
92+
<table>
93+
<thead>
94+
<tr class="header">
95+
<th align="left">id</th>
96+
<th align="left">A</th>
97+
</tr>
98+
</thead>
99+
<tbody>
100+
<tr class="odd">
101+
<td align="left">$t_8$</td>
102+
<td align="left">1</td>
103+
</tr>
104+
<tr class="even">
105+
<td align="left">$t_9$</td>
106+
<td align="left">2</td>
107+
</tr>
108+
</tbody>
109+
</table>
110+
<p>The lineage of $t_8$ is $\set{t_1, t_3, t_4}$, and the lineage of $t_9$ is $\set{R(t_2), S(t_6)}$. While the lineage of a tuple $t$ is sufficient for $t$ to appear in the output, the lineage is not necessary. For example, the lineage of $t_8$ does not capture the fact that $t_3$ and $t_4$ do not both have to appear in the input for $t_8$ to appear in the output. In fact, for a given output tuple $t$ there could be an exponential number of sufficient (but not necessary) witnesses for it.</p>
111+
<h3 id="why-provenance">Why-Provenance</h3>
112+
<p><strong>Why-provenance</strong> is similar to lineage but tries to avoid considering an exponential number of potential witnesses. Instead, it focuses on a restricted set of witnesses known as the <strong>witness basis</strong>. For example, the witness basis of $t_8$ is $\set{\set{t_1, t_3}, \set{t1_, t_4}}$. A <strong>minimal witness basis</strong> is a witness basis consisting only of minimal witnesses. That is, it won't include two witnesses $w$ and $w'$ where $w \subseteq w'$. The witness basis of two equivalent queries might differ, but the two queries are guaranteed to share the same minimal witness basis.</p>
113+
<h3 id="how-provenance">How-Provenance</h3>
114+
<p>Given an output tuple $t$, why-provenance provides witnesses that prove $t$ should appear in the output. However, why-provenance does not tell us <em>how</em> $t$ was formed from a witness. <strong>How-provenance</strong> uses a <strong>provenance semiring</strong> to hint at how an tuple was derived. The semiring consists of polynomials over tuple ids. The polynomial $t^2 + t \cdot t'$ hints at two derivations: one which uses $t$ twice and one which uses $t$ and $t'$.</p>
115+
<h3 id="where-provenance">Where-Provenance</h3>
116+
<p><strong>Where-provenance</strong> is very similar to why-provenance except that we'll now point at a particular entry (or <strong>location</strong>) of an output tuple $t$ and ask which input locations it was copied from. For example, the where-provenance of the $A$ entry of tuple $t_8$ is the $A$ entry of tuple $t_3$ or $t_4$.</p>
117+
<h3 id="eager-vs-lazy">Eager vs Lazy</h3>
118+
<p>There are two main ways to implement data lineage:</p>
119+
<ol style="list-style-type: decimal">
120+
<li>an <strong>eager</strong> (or <strong>bookkeeping</strong> or <strong>annotating</strong>) approach, and</li>
121+
<li>a <strong>lazy</strong> (or <strong>non-annotating</strong>) approach.</li>
122+
</ol>
123+
<p>In the eager approach, tuples are annotated and their annotations are propagated through the evaluation of a query. The lineage of an output tuple can then be directly determined using its annotations. In the lazy approach, tuples are not annotated. Instead, the lineage of a tuple must be derived by inspecting the query and input database.</p>
124+
<h3 id="notational-preliminaries">Notational Preliminaries</h3>
125+
<ul>
126+
<li>Let $\domain = \set{d_1, \ldots, d_n}$ be a finite domain of data values.</li>
127+
<li>Let $\fields$ be a collection of <strong>field names</strong> (or <strong>attribute names</strong>) where $U, V \subseteq \fields$.</li>
128+
<li>A <strong>record</strong> (or <strong>tuple</strong>) $t, u$ is a function $U \to \domain$ written $(A_1:d_1, \ldots, A_n:d_n)$.</li>
129+
<li>A tuple whose domain is $U$ is said to be a <strong>$U$-tuple</strong>.</li>
130+
<li>$\Tuple$ is the set of all tuples and $\UTuple$ is the set of all $U$-tuples.</li>
131+
<li>We write $\getfield{t}{A}$ as a shorthand for $t(A)$.</li>
132+
<li>We write $t[U]$ as a shorthand for the restriction of $t$ to $U$: $\typedlam{A}{U}{\getfield{t}{A}}$.</li>
133+
<li>We write $t[A \mapsto B]$ for the renaming of field $A$ to $B$.</li>
134+
<li>We write $(A: e(A))$ as a shorthand for $\typedlam{A}{U}{e(A)}$.</li>
135+
<li>A <strong>relation</strong> (or <strong>table</strong>) $r: U$ is a finite set of tuples over $U$.</li>
136+
<li>$\relations$ is a finite collection of <strong>relation names</strong>.</li>
137+
<li>A schema $\textbf{R}$ is a function $(R_1:U_1, \ldots, R_n:U_n)$ from $\relations$ to $2^{\fields}$.</li>
138+
<li>A <strong>database</strong> (or <strong>instance</strong>) $I: \textbf{R}$ is a function mapping each $R_i:U_i \in \textbf{R}$ to a relation $r_i$ over $U_i$.</li>
139+
<li>A <strong>tuple location</strong> is a tuple tagged with a relation name and is written $(R, t)$. We write $\TupleLoc = \relations \times \Tuple$ for the set of all tagged tuples.</li>
140+
<li>We can view a database $I$ as $\setst{(R, t)}{t \in I(R)} \subseteq \TupleLoc$.</li>
141+
<li>A <strong>field location</strong> is a triple $(R, t, A)$ which refers to a particular field or a particular tuple. We let $\FieldLoc$ be the set of all field locations.</li>
142+
<li>Letting $Y_{\bot} = Y \cup \set{\bot}$, we'll view a partial function $f: X \rightharpoonup Y$ as a total function $f: X \to Y_{\bot}$.</li>
143+
</ul>
144+
<p>Finally, this is the syntax of <strong>monotone relation algebra</strong>:</p>
145+
<pre><code>$$
146+
\begin{array}{rrl}
147+
Q &amp; ::= &amp; R \\
148+
&amp; | &amp; \set{t} \\
149+
&amp; | &amp; \sigma_{\theta}(Q) \\
150+
&amp; | &amp; \pi_{U}(Q) \\
151+
&amp; | &amp; Q_1 \bowtie Q_2 \\
152+
&amp; | &amp; Q_1 \cup Q_2 \\
153+
&amp; | &amp; \rho_{A \mapsto B}(Q) \\
154+
\end{array}
155+
$$</code></pre>
156+
<p>This is the semantics:</p>
157+
<pre><code>$$
158+
\begin{array}{rrl}
159+
\denote{R}(I) &amp; = &amp;
160+
\set{t} \\
161+
\denote{\set{t}}(I) &amp; = &amp;
162+
I(R) \\
163+
\denote{\sigma_{\theta}(Q)}(I) &amp; = &amp;
164+
\setst{t \in \denote{Q}(I)}{\theta(t)} \\
165+
\denote{\pi_{U}(Q)}(I) &amp; = &amp;
166+
\setst{t[U]}{t \in \denote{Q}(I)} \\
167+
\denote{Q_1 \bowtie Q_2}(I) &amp; = &amp;
168+
\setst{t}{t[U_1] \in \denote{Q_1}(I), t[U_2] \in \denote{Q_2}(I)} \\
169+
\denote{Q_1 \cup Q_2}(I) &amp; = &amp;
170+
\denote{Q_1}(I) \cup \denote{Q_2}(I) \\
171+
\denote{\rho_{A \mapsto B}(Q)}(I) &amp; = &amp;
172+
\setst{t[A \mapsto B]}{t \in \denote{Q}(I)} \\
173+
\end{array}
174+
$$</code></pre>
175+
<h2 id="chapter-2-why-provenance">Chapter 2: Why-Provenance</h2>
176+
<p>TODO</p>
177+
<h2 id="chapter-3-how-provenance">Chapter 3: How-Provenance</h2>
178+
<p>TODO</p>
179+
<h2 id="chapter-4-where-provenance">Chapter 4: Where-Provenance</h2>
180+
<p>TODO</p>
181+
<h2 id="chapter-5-comparing-models-of-provenance">Chapter 5: Comparing Models of Provenance</h2>
182+
<p>TODO</p>
183+
<script type="text/javascript" async
184+
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
185+
</script>
186+
187+
188+
<script type="text/x-mathjax-config">
189+
MathJax.Hub.Config({
190+
tex2jax: {
191+
inlineMath: [['$','$'], ['\\(','\\)']],
192+
skipTags: ['script', 'noscript', 'style', 'textarea'],
193+
},
194+
messageStyle: "none",
195+
});
196+
</script>
197+
</div>
198+
</body>
199+
</html>

index.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ <h1 id="indextitle">Papers</h1>
5555
<li><a href="html/yu2008dryadlinq.html">DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language <span class="year">(2008)</span></a></li>
5656
<li><a href="html/letia2009crdts.html">CRDTs: Consistency without concurrency control <span class="year">(2009)</span></a></li>
5757
<li><a href="html/graefe2009five.html">The Five-Minute Rule 20 Years Later <span class="year">(2009)</span></a></li>
58+
<li><a href="html/cheney2009provenance.html">Provenance in Databases: Why, How, and Where <span class="year">(2009)</span></a></li>
5859
<li><a href="html/lagar2009snowflock.html">SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing <span class="year">(2009)</span></a></li>
5960
<li><a href="html/alvaro2010boom.html">BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud <span class="year">(2010)</span></a></li>
6061
<li><a href="html/sigelman2010dapper.html">Dapper, a Large-Scale Distributed Systems Tracing Infrastructure<span class="year">(2010)</span></a></li>

0 commit comments

Comments
 (0)