Add initial count-min-sketch error-vs-width characterization profile#2
Draft
c-dickens wants to merge 1 commit into
Draft
Add initial count-min-sketch error-vs-width characterization profile#2c-dickens wants to merge 1 commit into
c-dickens wants to merge 1 commit into
Conversation
- Add error_vs_width profile C++ implementation - Add Python visualization script for error vs width analysis - Register profile in build system and main.cpp This profile characterizes how Count-Min Sketch error decreases with increasing sketch width across different width configurations.
c-dickens
pushed a commit
that referenced
this pull request
Feb 24, 2026
Consolidates the Count-Min Sketch frequency estimation characterization from PRs #1, #2, and #3 into a single clean profile. The profile sweeps across sketch widths (256-4096) with constant load factor (distinct/width ≈ 4), runs adaptive trials per width, and uses KLL sketches to track error distribution quantiles (median, p75, p90, p95, max) for both absolute and relative error metrics against theoretical bounds. https://claude.ai/code/session_01RmEdWmm6vYXY3XevAAsWVe
c-dickens
pushed a commit
that referenced
this pull request
Feb 25, 2026
Consolidates the Count-Min Sketch frequency estimation characterization from PRs #1, #2, and #3 into a single clean profile. The profile sweeps across sketch widths (256-4096) with constant load factor (distinct/width ≈ 4), runs adaptive trials per width, and uses KLL sketches to track error distribution quantiles (median, p75, p90, p95, max) for both absolute and relative error metrics against theoretical bounds. https://claude.ai/code/session_01RmEdWmm6vYXY3XevAAsWVe
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This profile characterizes how Count-Min Sketch error decreases with increasing sketch width across different width configurations.
Count-Min Sketch Error vs Width Profile Parameters
Sketch Configuration
Data Stream Parameters
Data Distribution (Zipf)
Trial Configuration
Theoretical Guarantees Tested
Expected Output Size
Note on Experimental Design: Stream Length Scaling Issue
Current Issue
The current count-min-sketch-error-vs-width profile uses fixed parameters across all width configurations:
Problem: At the largest width (16,384), the sketch has 2× more buckets than distinct items, making it nearly collision-free. This causes the median error to approach zero, not because of
the width scaling property, but simply because the sketch becomes a nearly-perfect hash table.
Why This Matters
The theoretical bound is:
Absolute Error ≤ εN = (e/width) × N
While this bound correctly halves as width doubles, the actual error decreases faster and approaches zero when width >> distinct_items, masking the true space-accuracy tradeoff.
Proposed Fix
To properly characterize the width vs error relationship, scale stream length proportionally with width:
// Instead of fixed stream_length = 1 << 17:
const size_t stream_length_base = 1 << 10; // 1,024
const size_t stream_length = stream_length_base * width;
// Or keep εN constant across widths:
const size_t stream_length = width * target_epsilon_n / epsilon;
This ensures that:
Alternative Approach
Keep fixed N but scale the number of distinct items with width:
const unsigned zipf_lg_range = lg_width; // Range grows with width
This maintains a constant ratio of width to distinct items, avoiding the collision-free regime.