|
| 1 | +# Score normalizer |
| 2 | + |
| 3 | +The score normalizer is a heuristic algorithm that attempts to map a incoming |
| 4 | +stream of numbers drawn from some unknown distribution to a uniform |
| 5 | +distribution. This allows us to directly compare scores produced by different |
| 6 | +underlying distributions, namely, by different search providers. |
| 7 | + |
| 8 | +## Overview and terminology |
| 9 | + |
| 10 | +The input to the normalizer is a stream of scores, `x`, drawn from an |
| 11 | +underlying continuous distribution, `P(x)`. Its goal is to learn a function |
| 12 | +`f(x)` such that `f(x ~ P)` is a uniform distribution over `[0,1]`. |
| 13 | + |
| 14 | +## Reservoir sampling |
| 15 | + |
| 16 | +Reservoir sampling is a method for computing `f`. Its simplest form is: |
| 17 | + |
| 18 | +- Take the first `N` (eg. 100) scores as a 'reservoir'. |
| 19 | +- Define `f(x) = sum(y < x for y in reservoir) / N` |
| 20 | + |
| 21 | +In other words, we take the first `N` scores as a reservoir, and then |
| 22 | +normalize a new score by computing its quantile within the reservoir. |
| 23 | + |
| 24 | +We've constructed `f` by choosing a series of dividers (the reservoir scores) |
| 25 | +which we hope are a representative sample of `P`. Consequently, a new `x ~ P` |
| 26 | +will fall into any 'bucket' between two dividers with equal probability. So, |
| 27 | +the index of the bucket x falls into is the quantile of x within `P`. Thus, |
| 28 | +with some allowance for the fact that f is discrete, `f(x ~ P) ~= U(0,1)`. |
| 29 | + |
| 30 | +The drawback of this is that the reservoir needs to be quite large before |
| 31 | +this is effective. |
| 32 | + |
| 33 | +## Bin-entropy reservoir sampling |
| 34 | + |
| 35 | +We're going to trade off some accuracy for efficiency, and build on the core |
| 36 | +idea of reservoir sampling. We will store 'bins' that cover the real line, and |
| 37 | +try to keep the number of elements we've seen in each bin approximately equal. |
| 38 | +Then we can compute `f(x)` by using the bin index of `x` as a quantile. |
| 39 | + |
| 40 | +__Data structure__. We store a vector of N 'bins', each one with a count and |
| 41 | +a lower divider. The bottom bin always has a lower divider of negative |
| 42 | +infinity. `bins[i].count` roughly tracks the number of scores we've seen |
| 43 | +between `bins[i-1].lower_divider` and `bins[i].lower_divider`. |
| 44 | + |
| 45 | +__Algorithm overview.__ Each time a score is recorded, we increment the count |
| 46 | +of the appropriate bin. Then, we propose a repartition of the bins as follows: |
| 47 | + |
| 48 | +- Split the bin corresponding to the new score into two bins, divided by the |
| 49 | + new score. Halve the counts of the old bin into the two new bins. |
| 50 | + |
| 51 | +- Merge together the two smallest contiguous bins, so that the total number of |
| 52 | + bins stays equal. |
| 53 | + |
| 54 | +We accept this repartition if it increases the evenness across all bins, which |
| 55 | +we measure using entropy. For simplicity, we prevent the merged bins from |
| 56 | +containing either of the split bins. |
| 57 | + |
| 58 | +__Entropy.__ Given a collection of bins, its entropy is: |
| 59 | + |
| 60 | +``` |
| 61 | +H(p) = -sum( bin.count/total * log(bin.count/total) for bin in bins ) |
| 62 | +``` |
| 63 | + |
| 64 | +Entropy has two key properties: |
| 65 | +1. The discrete distribution with maximum entropy is the uniform distribution. |
| 66 | +2. Entropy is convex. |
| 67 | + |
| 68 | +As such, we can make our bins as uniform as possible by seeking to maximize |
| 69 | +entropy. This is implemented by only accepting a repartition if it increases |
| 70 | +entropy. |
| 71 | + |
| 72 | +__Fast entropy calculations.__ Suppose we are considering a repartition that: |
| 73 | + - splits bin `s` into `s_l` and `s_r`, |
| 74 | + - merges bins `m_l` and `m_r` into `m`. |
| 75 | + |
| 76 | +We don't need to compute the full entropy over the old and new sets of bins, |
| 77 | +because the entropy contributions of bins not related to the split or merge |
| 78 | +cancel out. As such, we can calculate this with only six terms as follows, |
| 79 | +where `p(bin) = bin.count / total`. |
| 80 | + |
| 81 | +``` |
| 82 | +H(new) - H(old) = [ -p(s_l) log p(s_l) - p(s_r) log p(s_r) - p(m) log p(m) ] |
| 83 | + - [ -p(s) log p(s) - p(m_l) log p(m_l) - p(m_r) log p(m_r) ] |
| 84 | +``` |
| 85 | + |
| 86 | +__Algorithm pseudocode.__ This leaves us with the following algorithm for |
| 87 | +an update. |
| 88 | + |
| 89 | +``` |
| 90 | +def update(score): |
| 91 | + if there aren't N bins yet: |
| 92 | + insert Bin(lower_divider=score, count=1) |
| 93 | + return |
| 94 | +
|
| 95 | + s = score's bin |
| 96 | + s.count++ |
| 97 | +
|
| 98 | + m_l, m_r = smallest contiguous pair of bins separate from s |
| 99 | + m = Bin(lower_divider=m_l.lower_divider, count=m_l.count + m_r.count) |
| 100 | +
|
| 101 | + s_l = Bin(lower_divider=s.lower_divider, count=s.count/2) |
| 102 | + s_r = Bin(lower_divider=score, count=s.count/2) |
| 103 | +
|
| 104 | + if H(new) > H(old): |
| 105 | + replace s with s_l and s_r |
| 106 | + replace m_l and m_r with m |
| 107 | +``` |
| 108 | + |
| 109 | +## Performance |
| 110 | + |
| 111 | +We did some benchmarking under these conditions: |
| 112 | + - Two tests: samples from `Beta(2,5)`, and a power-law distribution. |
| 113 | + - The bin-entropy algorithm targeting 5 bins. |
| 114 | + - The 'simple' algorithm from above storing `N` samples. |
| 115 | + - The 'better-simple' algorithm that stores the last `N` samples and uses |
| 116 | + them to decide the dividers for 5 bins. |
| 117 | + |
| 118 | +Results showed that the bin-entropy algorithm performs about as well as the |
| 119 | +better-simple algorithm when `N = 150`, and actually slightly outperforms it for |
| 120 | +`N < 150`. This may be due to the split/merge making it less sensitive to noise. |
| 121 | +It's difficult to compare the results of the simple algorithm because the number |
| 122 | +of bins is different, but the results were significantly noisier than the |
| 123 | +better-simple algorithm. |
| 124 | + |
| 125 | +# Ideas for improvement |
| 126 | + |
| 127 | +Here are some things that anecdotally improve performance: |
| 128 | + |
| 129 | + - The algorithm loses information when a bin is split, because splitting the |
| 130 | + counts evenly is uninformed. Tracking moments of the distribution within |
| 131 | + each bin can help make this a more informed choice. |
| 132 | + |
| 133 | + - We only perform one split/merge operation per insert, but it's possible |
| 134 | + several split-merges should be done after one insert. |
| 135 | + |
| 136 | + - Allowing the merged bins to contain one of the split bins provides a small |
| 137 | + improvement in performance. |
0 commit comments