Add initial count-min-sketch error-vs-width characterization profile by c-dickens · Pull Request #2 · c-dickens/datasketches-characterization

c-dickens · 2026-01-20T23:21:20Z

Add error_vs_width profile C++ implementation
Add Python visualization script for error vs width analysis
Register profile in build system and main.cpp

This profile characterizes how Count-Min Sketch error decreases with increasing sketch width across different width configurations.

Count-Min Sketch Error vs Width Profile Parameters

Sketch Configuration

Widths tested: lg_width = {8, 10, 12, 14}
- Actual widths: 256, 1024, 4096, 16384
Depth (hash functions): 5 (fixed across all widths)
Theoretical epsilon: e/width (varies by width)
- Width 256: ε ≈ 0.0106
- Width 1024: ε ≈ 0.0027
- Width 4096: ε ≈ 0.00066
- Width 16384: ε ≈ 0.00017

Data Stream Parameters

Stream length: 131,072 (2^17 items per trial)
Number of trials: 1,024 (2^10 independent trials)
Total items processed: ~134 million (1,024 trials × 131,072 items)

Data Distribution (Zipf)

Distribution type: Zipfian
Range: 8,192 distinct values (2^13)
Zipf exponent: 1.1 (realistic skew for real-world data)

Trial Configuration

Random seed: Trial-specific (base seed 42 + trial × 1000)
Same stream: All width configurations tested on identical stream per trial
Output: Per-item frequency estimates for all distinct items across all trials

Theoretical Guarantees Tested

Error bound: estimate ≤ true_freq + ε×N
Where:
- ε = e/width (theoretical relative error)
- N = total stream weight (131,072)

Expected Output Size

~1.7GB TSV file with columns:
- Trial, LgWidth, Width, Depth, Epsilon, TheoreticalEpsilon
- TrueFreq, Estimate, AbsError, RelError
- TotalWeight, ErrorBound, WithinBound
- FreqRatio, NumDistinct

Note on Experimental Design: Stream Length Scaling Issue

Current Issue

The current count-min-sketch-error-vs-width profile uses fixed parameters across all width configurations:

Stream length N = 131,072 (2^17)
Distinct items ≈ 8,192 (2^13 via Zipf distribution)
Widths tested: 256, 1,024, 4,096, 16,384

Problem: At the largest width (16,384), the sketch has 2× more buckets than distinct items, making it nearly collision-free. This causes the median error to approach zero, not because of
the width scaling property, but simply because the sketch becomes a nearly-perfect hash table.

Why This Matters

The theoretical bound is:
Absolute Error ≤ εN = (e/width) × N

While this bound correctly halves as width doubles, the actual error decreases faster and approaches zero when width >> distinct_items, masking the true space-accuracy tradeoff.

Proposed Fix

To properly characterize the width vs error relationship, scale stream length proportionally with width:

// Instead of fixed stream_length = 1 << 17:
const size_t stream_length_base = 1 << 10; // 1,024
const size_t stream_length = stream_length_base * width;

// Or keep εN constant across widths:
const size_t stream_length = width * target_epsilon_n / epsilon;

This ensures that:

The collision probability stays meaningful across all widths
The bound εN remains a relevant predictor of actual error
The experiment demonstrates true scaling behavior, not saturation effects

Alternative Approach

Keep fixed N but scale the number of distinct items with width:
const unsigned zipf_lg_range = lg_width; // Range grows with width

This maintains a constant ratio of width to distinct items, avoiding the collision-free regime.

- Add error_vs_width profile C++ implementation - Add Python visualization script for error vs width analysis - Register profile in build system and main.cpp This profile characterizes how Count-Min Sketch error decreases with increasing sketch width across different width configurations.

Consolidates the Count-Min Sketch frequency estimation characterization from PRs #1, #2, and #3 into a single clean profile. The profile sweeps across sketch widths (256-4096) with constant load factor (distinct/width ≈ 4), runs adaptive trials per width, and uses KLL sketches to track error distribution quantiles (median, p75, p90, p95, max) for both absolute and relative error metrics against theoretical bounds. https://claude.ai/code/session_01RmEdWmm6vYXY3XevAAsWVe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial count-min-sketch error-vs-width characterization profile#2

Add initial count-min-sketch error-vs-width characterization profile#2
c-dickens wants to merge 1 commit into
masterfrom
add-count-min-error-vs-width-profile

c-dickens commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

c-dickens commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant