Skip to content

Commit b75635c

Browse files
committed
Added (most of) index paper.
1 parent 38f711e commit b75635c

File tree

3 files changed

+257
-0
lines changed

3 files changed

+257
-0
lines changed

html/o1997improved.html

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
<!DOCTYPE html>
2+
<html>
3+
<head>
4+
<title>Papers</title>
5+
<link href='../css/style.css' rel='stylesheet'>
6+
<meta name=viewport content="width=device-width, initial-scale=1">
7+
</head>
8+
9+
<body>
10+
<div id=header>
11+
<a href="../">Papers</a>
12+
</div>
13+
<div id="container">
14+
<h1 id="improved-query-performance-with-variant-indexes-1997"><a href="https://scholar.google.com/scholar?cluster=3279297021955127822">Improved Query Performance with Variant Indexes (1997)</a></h1>
15+
<p>This paper surveys three types of indexes: value-list indexes (old), bit-sliced indexes (new), and projection indexes (new). It then shows how to compute aggregates, range predicates, and OLAP queries using these three types of indexes.</p>
16+
<h2 id="value-list-indexes">Value-List Indexes</h2>
17+
<p>A <strong>Value-List index</strong> is a B+ tree index. Each leaf of a Value-List index either stores a list of record ids (RIDs) or a bitmap.</p>
18+
<p>A <strong>bitmap</strong> on a set $T$ of $n$ tuples compactly represents a subset of $T$. It is implemented as an $M$-length bitstring and a mapping $m: T \to [0, M-1]$. If $t_i$ is present in the subset, then the $m(t_i)$th bit in the bitstring is set. Note that $m(t_i)$ does not have to be $i$. Often times, if a tuple $t$ is the $i$th tuple on page $p$, then $m(t)$ is a number $j$ where the high order bits of $j$ are $p$ and the low order bits of $j$ are $i$.</p>
19+
<p>The leaf entry for key $k$ in a bitmap Value-List index is a bitmap indicating which tuples have key $k$. If the index key of the B+ tree has only a few values, then a bitmap B+ tree can take up less space than an RID B+ tree.</p>
20+
<p>Moreover, bitwise operations over a bitmap can be computed very efficiently. This comes in handy. For example, imagine we have the query <code>SELECT * FROM R WHERE a and b</code>. If we compute two bitmaps $f_a$ and $f_b$ indicating which tuples of <code>R</code> satisfy <code>a</code> and <code>b</code>, then we can quickly compute the bitwise AND of $f_a$ and $f_b$.</p>
21+
<p>Imagine that we can fit 1000 bits on a single page. We can segment the rows of a table into sets of 1000. This lets us compress RID lists and also avoid some bitstring operations (see paper for details).</p>
22+
<h2 id="projection-indexes">Projection Indexes</h2>
23+
<p>A <strong>projection index</strong> on a column is just that column stored contiguously. For example, if we had the following table <code>R(a, b, c)</code>:</p>
24+
<pre><code>+---+---+---+
25+
| a | b | c |
26+
+---+---+---+
27+
| 1 | 2 | 3 |
28+
| 2 | 3 | 4 |
29+
| 3 | 4 | 5 |
30+
| 4 | 5 | 6 |
31+
| 5 | 6 | 7 |
32+
+---+---+---+</code></pre>
33+
<p>then a projection index on <code>b</code> would be</p>
34+
<pre><code>+---+
35+
| b |
36+
+---+
37+
| 2 |
38+
| 3 |
39+
| 4 |
40+
| 5 |
41+
| 6 |
42+
+---+</code></pre>
43+
<h2 id="bit-sliced-indexes">Bit-Sliced Indexes</h2>
44+
<p>Imagine a column of integers that looks something like this:</p>
45+
<pre><code>+---+
46+
| 0 |
47+
| 1 |
48+
| 2 |
49+
| 3 |
50+
| 4 |
51+
+---+</code></pre>
52+
<p>We can view each integer as a bitstring:</p>
53+
<pre><code>+-----+
54+
| 000 |
55+
| 001 |
56+
| 010 |
57+
| 011 |
58+
| 100 |
59+
+-----+</code></pre>
60+
<p>A <strong>bit-sliced index</strong> stores a bitstring for every column of bits. For example, a bit-sliced index on the column above would store <code>00001</code> (first column), <code>00110</code> (second column), and <code>01010</code> (third column).</p>
61+
<h2 id="computing-aggregates-with-indexes">Computing Aggregates with Indexes</h2>
62+
<p>Imagine we want to compute the query <code>SELECT SUM(c) FROM R WHERE p</code> for some predicate <code>p</code>. Imagine we have already computed a bitmap $f_p$ indicating which tuples satisfy <code>p</code>. Here's how compute the query with the various indexes:</p>
63+
<ol style="list-style-type: decimal">
64+
<li><strong>No index.</strong> Without any index, we're forced to read through <code>R</code>. Assuming that only a fraction of the tuples in <code>R</code> satisfy <code>p</code>, some pages of <code>R</code> end up not having any satisfied tuples, so we don't have to read those.</li>
65+
<li><strong>Value-List bitmap index.</strong> We iterate over every key $k$ to retrieve a bitamap $f_k$ and compute the bitwise AND of $f_k$ and $f_p$. We compute the popcount of this AND, multiply it by $k$, and add it to our running sum.</li>
66+
<li><strong>Projection index.</strong> We iterate through the projection index and add any value with a bit set in $f_p$.</li>
67+
<li><strong>Bit-sliced index.</strong> For each column $c_i$, we add $\text{popcount}(i) * 2^i$ to our sum.</li>
68+
</ol>
69+
<p>There are other algorithms to compute other aggregate functions as well (see paper). Here is a summary of the best index for each aggregate:</p>
70+
<table>
71+
<thead>
72+
<tr class="header">
73+
<th align="left">Aggregate</th>
74+
<th align="left">Best Index</th>
75+
</tr>
76+
</thead>
77+
<tbody>
78+
<tr class="odd">
79+
<td align="left">sum</td>
80+
<td align="left">bit-sliced</td>
81+
</tr>
82+
<tr class="even">
83+
<td align="left">count</td>
84+
<td align="left">no index needed</td>
85+
</tr>
86+
<tr class="odd">
87+
<td align="left">average</td>
88+
<td align="left">bit-sliced</td>
89+
</tr>
90+
<tr class="even">
91+
<td align="left">max/min</td>
92+
<td align="left">value-list</td>
93+
</tr>
94+
<tr class="odd">
95+
<td align="left">median</td>
96+
<td align="left">value-list</td>
97+
</tr>
98+
</tbody>
99+
</table>
100+
<h2 id="computing-range-predicates-with-indexes">Computing Range Predicates with Indexes</h2>
101+
<p>Imagine we want to compute the query <code>SELECT * FROM c &gt; 100 AND p</code> where for some arbitrary predicate <code>p</code>. Given a bitmap $f_p$ indicating which tuples satisfy <code>p</code>, we want to compute a bitmap $f$ indicating which tuples satisfy <code>p</code> and the range predicate <code>c &gt; 100</code>.</p>
102+
<ol style="list-style-type: decimal">
103+
<li><strong>Value-List bitmap index.</strong> We OR together every bitmap $b$ for every key $k$ that satisfies the range predicate and then AND it with $f_p$.</li>
104+
<li><strong>Projection index.</strong> We iterate through the values indicated by $f_p$ and see which satisfy the range predicate.</li>
105+
<li><strong>Bit-sliced index.</strong> We perform some intense bit tricks (see paper).</li>
106+
</ol>
107+
<p>In summary, Value-List indexes are best for narrow ranges and bit-sliced indexes are best for wide ranges.</p>
108+
<p>TODO(mwhittaker): Read and summarize the last three sections of this paper. They are pretty dense and a little boring.</p>
109+
<script type="text/javascript" async
110+
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML">
111+
</script>
112+
113+
114+
</div>
115+
116+
<script type="text/javascript" src="../js/mathjax_config.js"></script>
117+
</body>
118+
</html>

index.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@
3636
<li><a href="html/bershad1995spin.html">SPIN -- An Extensible Microkernel for Application-specific Operating System Services <span class="year">(1995)</span></a></li>
3737
<li><a href="html/wilkes1996hp.html">The HP AutoRAID hierarchical storage system <span class="year">(1996)</span></a></li>
3838
<li><a href="html/bugnion1997disco.html">Disco: Running Commodity Operating Systems on Scalable Multiprocessors <span class="year">(1997)</span></a></li>
39+
<li><a href="html/o1997improved.html">Improved Query Performance with Variant Indexes <span class="year">(1997)</span></a></li>
3940
<li><a href="html/lehman1999t.html">T Spaces: The Next Wave <span class="year">(1999)</span></a></li>
4041
<li><a href="html/avnur2000eddies.html">Eddies: Continuously Adaptive Query Processing <span class="year">(1999)</span></a></li>
4142
<li><a href="html/adya2000generalized.html">Generalized Isolation Level Definitions <span class="year">(2000)</span></a></li>

papers/o1997improved.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# [Improved Query Performance with Variant Indexes (1997)](https://scholar.google.com/scholar?cluster=3279297021955127822)
2+
This paper surveys three types of indexes: value-list indexes (old), bit-sliced
3+
indexes (new), and projection indexes (new). It then shows how to compute
4+
aggregates, range predicates, and OLAP queries using these three types of
5+
indexes.
6+
7+
## Value-List Indexes
8+
A **Value-List index** is a B+ tree index. Each leaf of a Value-List index
9+
either stores a list of record ids (RIDs) or a bitmap.
10+
11+
A **bitmap** on a set $T$ of $n$ tuples compactly represents a subset of $T$.
12+
It is implemented as an $M$-length bitstring and a mapping $m: T \to [0, M-1]$.
13+
If $t_i$ is present in the subset, then the $m(t_i)$th bit in the bitstring is
14+
set. Note that $m(t_i)$ does not have to be $i$. Often times, if a tuple $t$ is
15+
the $i$th tuple on page $p$, then $m(t)$ is a number $j$ where the high order
16+
bits of $j$ are $p$ and the low order bits of $j$ are $i$.
17+
18+
The leaf entry for key $k$ in a bitmap Value-List index is a bitmap indicating
19+
which tuples have key $k$. If the index key of the B+ tree has only a few
20+
values, then a bitmap B+ tree can take up less space than an RID B+ tree.
21+
22+
Moreover, bitwise operations over a bitmap can be computed very efficiently.
23+
This comes in handy. For example, imagine we have the query `SELECT * FROM R
24+
WHERE a and b`. If we compute two bitmaps $f_a$ and $f_b$ indicating which
25+
tuples of `R` satisfy `a` and `b`, then we can quickly compute the bitwise AND
26+
of $f_a$ and $f_b$.
27+
28+
Imagine that we can fit 1000 bits on a single page. We can segment the rows of
29+
a table into sets of 1000. This lets us compress RID lists and also avoid some
30+
bitstring operations (see paper for details).
31+
32+
## Projection Indexes
33+
A **projection index** on a column is just that column stored contiguously. For
34+
example, if we had the following table `R(a, b, c)`:
35+
36+
```
37+
+---+---+---+
38+
| a | b | c |
39+
+---+---+---+
40+
| 1 | 2 | 3 |
41+
| 2 | 3 | 4 |
42+
| 3 | 4 | 5 |
43+
| 4 | 5 | 6 |
44+
| 5 | 6 | 7 |
45+
+---+---+---+
46+
```
47+
48+
then a projection index on `b` would be
49+
50+
```
51+
+---+
52+
| b |
53+
+---+
54+
| 2 |
55+
| 3 |
56+
| 4 |
57+
| 5 |
58+
| 6 |
59+
+---+
60+
```
61+
62+
## Bit-Sliced Indexes
63+
Imagine a column of integers that looks something like this:
64+
65+
```
66+
+---+
67+
| 0 |
68+
| 1 |
69+
| 2 |
70+
| 3 |
71+
| 4 |
72+
+---+
73+
```
74+
75+
We can view each integer as a bitstring:
76+
77+
```
78+
+-----+
79+
| 000 |
80+
| 001 |
81+
| 010 |
82+
| 011 |
83+
| 100 |
84+
+-----+
85+
```
86+
87+
A **bit-sliced index** stores a bitstring for every column of bits. For
88+
example, a bit-sliced index on the column above would store `00001` (first
89+
column), `00110` (second column), and `01010` (third column).
90+
91+
## Computing Aggregates with Indexes
92+
Imagine we want to compute the query `SELECT SUM(c) FROM R WHERE p` for some
93+
predicate `p`. Imagine we have already computed a bitmap $f_p$ indicating which
94+
tuples satisfy `p`. Here's how compute the query with the various indexes:
95+
96+
1. **No index.** Without any index, we're forced to read through `R`. Assuming
97+
that only a fraction of the tuples in `R` satisfy `p`, some pages of `R` end
98+
up not having any satisfied tuples, so we don't have to read those.
99+
2. **Value-List bitmap index.** We iterate over every key $k$ to retrieve a
100+
bitamap $f_k$ and compute the bitwise AND of $f_k$ and $f_p$. We compute the
101+
popcount of this AND, multiply it by $k$, and add it to our running sum.
102+
3. **Projection index.** We iterate through the projection index and add any
103+
value with a bit set in $f_p$.
104+
4. **Bit-sliced index.** For each column $c_i$, we add $\text{popcount}(i) *
105+
2^i$ to our sum.
106+
107+
There are other algorithms to compute other aggregate functions as well (see
108+
paper). Here is a summary of the best index for each aggregate:
109+
110+
| Aggregate | Best Index |
111+
| --------- | --------------- |
112+
| sum | bit-sliced |
113+
| count | no index needed |
114+
| average | bit-sliced |
115+
| max/min | value-list |
116+
| median | value-list |
117+
118+
## Computing Range Predicates with Indexes
119+
Imagine we want to compute the query `SELECT * FROM c > 100 AND p` where for
120+
some arbitrary predicate `p`. Given a bitmap $f_p$ indicating which tuples
121+
satisfy `p`, we want to compute a bitmap $f$ indicating which tuples satisfy
122+
`p` and the range predicate `c > 100`.
123+
124+
1. **Value-List bitmap index.** We OR together every bitmap $b$ for every key
125+
$k$ that satisfies the range predicate and then AND it with $f_p$.
126+
2. **Projection index.** We iterate through the values indicated by $f_p$ and
127+
see which satisfy the range predicate.
128+
3. **Bit-sliced index.** We perform some intense bit tricks (see paper).
129+
130+
In summary, Value-List indexes are best for narrow ranges and bit-sliced
131+
indexes are best for wide ranges.
132+
133+
TODO(mwhittaker): Read and summarize the last three sections of this paper.
134+
They are pretty dense and a little boring.
135+
136+
<script type="text/javascript" async
137+
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML">
138+
</script>

0 commit comments

Comments
 (0)