Skip to content

Commit 3a090ad

Browse files
tbyChromium LUCI CQ
authored andcommitted
[Categorical search] Add documentation for score normalizer
The forthcoming score normalization algorithm, which is a tweak of adafang@'s previous work, is complicated enough that it warrants some documentation in the codebase. This CL adds a .md, and follow-up CLs will implement the algorithm itself. Bug: 1199206 Change-Id: I87a77e1579c65422c00d0f908916b56a7e35cb4a Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3189836 Commit-Queue: Tony Yeoman <[email protected]> Reviewed-by: Rachel Wong <[email protected]> Cr-Commit-Position: refs/heads/main@{#925647}
1 parent 88e218e commit 3a090ad

File tree

1 file changed

+137
-0
lines changed

1 file changed

+137
-0
lines changed
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Score normalizer
2+
3+
The score normalizer is a heuristic algorithm that attempts to map a incoming
4+
stream of numbers drawn from some unknown distribution to a uniform
5+
distribution. This allows us to directly compare scores produced by different
6+
underlying distributions, namely, by different search providers.
7+
8+
## Overview and terminology
9+
10+
The input to the normalizer is a stream of scores, `x`, drawn from an
11+
underlying continuous distribution, `P(x)`. Its goal is to learn a function
12+
`f(x)` such that `f(x ~ P)` is a uniform distribution over `[0,1]`.
13+
14+
## Reservoir sampling
15+
16+
Reservoir sampling is a method for computing `f`. Its simplest form is:
17+
18+
- Take the first `N` (eg. 100) scores as a 'reservoir'.
19+
- Define `f(x) = sum(y < x for y in reservoir) / N`
20+
21+
In other words, we take the first `N` scores as a reservoir, and then
22+
normalize a new score by computing its quantile within the reservoir.
23+
24+
We've constructed `f` by choosing a series of dividers (the reservoir scores)
25+
which we hope are a representative sample of `P`. Consequently, a new `x ~ P`
26+
will fall into any 'bucket' between two dividers with equal probability. So,
27+
the index of the bucket x falls into is the quantile of x within `P`. Thus,
28+
with some allowance for the fact that f is discrete, `f(x ~ P) ~= U(0,1)`.
29+
30+
The drawback of this is that the reservoir needs to be quite large before
31+
this is effective.
32+
33+
## Bin-entropy reservoir sampling
34+
35+
We're going to trade off some accuracy for efficiency, and build on the core
36+
idea of reservoir sampling. We will store 'bins' that cover the real line, and
37+
try to keep the number of elements we've seen in each bin approximately equal.
38+
Then we can compute `f(x)` by using the bin index of `x` as a quantile.
39+
40+
__Data structure__. We store a vector of N 'bins', each one with a count and
41+
a lower divider. The bottom bin always has a lower divider of negative
42+
infinity. `bins[i].count` roughly tracks the number of scores we've seen
43+
between `bins[i-1].lower_divider` and `bins[i].lower_divider`.
44+
45+
__Algorithm overview.__ Each time a score is recorded, we increment the count
46+
of the appropriate bin. Then, we propose a repartition of the bins as follows:
47+
48+
- Split the bin corresponding to the new score into two bins, divided by the
49+
new score. Halve the counts of the old bin into the two new bins.
50+
51+
- Merge together the two smallest contiguous bins, so that the total number of
52+
bins stays equal.
53+
54+
We accept this repartition if it increases the evenness across all bins, which
55+
we measure using entropy. For simplicity, we prevent the merged bins from
56+
containing either of the split bins.
57+
58+
__Entropy.__ Given a collection of bins, its entropy is:
59+
60+
```
61+
H(p) = -sum( bin.count/total * log(bin.count/total) for bin in bins )
62+
```
63+
64+
Entropy has two key properties:
65+
1. The discrete distribution with maximum entropy is the uniform distribution.
66+
2. Entropy is convex.
67+
68+
As such, we can make our bins as uniform as possible by seeking to maximize
69+
entropy. This is implemented by only accepting a repartition if it increases
70+
entropy.
71+
72+
__Fast entropy calculations.__ Suppose we are considering a repartition that:
73+
- splits bin `s` into `s_l` and `s_r`,
74+
- merges bins `m_l` and `m_r` into `m`.
75+
76+
We don't need to compute the full entropy over the old and new sets of bins,
77+
because the entropy contributions of bins not related to the split or merge
78+
cancel out. As such, we can calculate this with only six terms as follows,
79+
where `p(bin) = bin.count / total`.
80+
81+
```
82+
H(new) - H(old) = [ -p(s_l) log p(s_l) - p(s_r) log p(s_r) - p(m) log p(m) ]
83+
- [ -p(s) log p(s) - p(m_l) log p(m_l) - p(m_r) log p(m_r) ]
84+
```
85+
86+
__Algorithm pseudocode.__ This leaves us with the following algorithm for
87+
an update.
88+
89+
```
90+
def update(score):
91+
if there aren't N bins yet:
92+
insert Bin(lower_divider=score, count=1)
93+
return
94+
95+
s = score's bin
96+
s.count++
97+
98+
m_l, m_r = smallest contiguous pair of bins separate from s
99+
m = Bin(lower_divider=m_l.lower_divider, count=m_l.count + m_r.count)
100+
101+
s_l = Bin(lower_divider=s.lower_divider, count=s.count/2)
102+
s_r = Bin(lower_divider=score, count=s.count/2)
103+
104+
if H(new) > H(old):
105+
replace s with s_l and s_r
106+
replace m_l and m_r with m
107+
```
108+
109+
## Performance
110+
111+
We did some benchmarking under these conditions:
112+
- Two tests: samples from `Beta(2,5)`, and a power-law distribution.
113+
- The bin-entropy algorithm targeting 5 bins.
114+
- The 'simple' algorithm from above storing `N` samples.
115+
- The 'better-simple' algorithm that stores the last `N` samples and uses
116+
them to decide the dividers for 5 bins.
117+
118+
Results showed that the bin-entropy algorithm performs about as well as the
119+
better-simple algorithm when `N = 150`, and actually slightly outperforms it for
120+
`N < 150`. This may be due to the split/merge making it less sensitive to noise.
121+
It's difficult to compare the results of the simple algorithm because the number
122+
of bins is different, but the results were significantly noisier than the
123+
better-simple algorithm.
124+
125+
# Ideas for improvement
126+
127+
Here are some things that anecdotally improve performance:
128+
129+
- The algorithm loses information when a bin is split, because splitting the
130+
counts evenly is uninformed. Tracking moments of the distribution within
131+
each bin can help make this a more informed choice.
132+
133+
- We only perform one split/merge operation per insert, but it's possible
134+
several split-merges should be done after one insert.
135+
136+
- Allowing the merged bins to contain one of the split bins provides a small
137+
improvement in performance.

0 commit comments

Comments
 (0)