Skip to content

Add initial count-min-sketch error-vs-width characterization profile#2

Draft
c-dickens wants to merge 1 commit into
masterfrom
add-count-min-error-vs-width-profile
Draft

Add initial count-min-sketch error-vs-width characterization profile#2
c-dickens wants to merge 1 commit into
masterfrom
add-count-min-error-vs-width-profile

Conversation

@c-dickens

Copy link
Copy Markdown
Owner
  • Add error_vs_width profile C++ implementation
  • Add Python visualization script for error vs width analysis
  • Register profile in build system and main.cpp

This profile characterizes how Count-Min Sketch error decreases with increasing sketch width across different width configurations.

cms_error_vs_width

Count-Min Sketch Error vs Width Profile Parameters

Sketch Configuration

  • Widths tested: lg_width = {8, 10, 12, 14}
    • Actual widths: 256, 1024, 4096, 16384
  • Depth (hash functions): 5 (fixed across all widths)
  • Theoretical epsilon: e/width (varies by width)
    • Width 256: ε ≈ 0.0106
    • Width 1024: ε ≈ 0.0027
    • Width 4096: ε ≈ 0.00066
    • Width 16384: ε ≈ 0.00017

Data Stream Parameters

  • Stream length: 131,072 (2^17 items per trial)
  • Number of trials: 1,024 (2^10 independent trials)
  • Total items processed: ~134 million (1,024 trials × 131,072 items)

Data Distribution (Zipf)

  • Distribution type: Zipfian
  • Range: 8,192 distinct values (2^13)
  • Zipf exponent: 1.1 (realistic skew for real-world data)

Trial Configuration

  • Random seed: Trial-specific (base seed 42 + trial × 1000)
  • Same stream: All width configurations tested on identical stream per trial
  • Output: Per-item frequency estimates for all distinct items across all trials

Theoretical Guarantees Tested

  • Error bound: estimate ≤ true_freq + ε×N
  • Where:
    • ε = e/width (theoretical relative error)
    • N = total stream weight (131,072)

Expected Output Size

  • ~1.7GB TSV file with columns:
    • Trial, LgWidth, Width, Depth, Epsilon, TheoreticalEpsilon
    • TrueFreq, Estimate, AbsError, RelError
    • TotalWeight, ErrorBound, WithinBound
    • FreqRatio, NumDistinct

Note on Experimental Design: Stream Length Scaling Issue

Current Issue

The current count-min-sketch-error-vs-width profile uses fixed parameters across all width configurations:

  • Stream length N = 131,072 (2^17)
  • Distinct items ≈ 8,192 (2^13 via Zipf distribution)
  • Widths tested: 256, 1,024, 4,096, 16,384

Problem: At the largest width (16,384), the sketch has 2× more buckets than distinct items, making it nearly collision-free. This causes the median error to approach zero, not because of
the width scaling property, but simply because the sketch becomes a nearly-perfect hash table.

Why This Matters

The theoretical bound is:
Absolute Error ≤ εN = (e/width) × N

While this bound correctly halves as width doubles, the actual error decreases faster and approaches zero when width >> distinct_items, masking the true space-accuracy tradeoff.

Proposed Fix

To properly characterize the width vs error relationship, scale stream length proportionally with width:

// Instead of fixed stream_length = 1 << 17:
const size_t stream_length_base = 1 << 10; // 1,024
const size_t stream_length = stream_length_base * width;

// Or keep εN constant across widths:
const size_t stream_length = width * target_epsilon_n / epsilon;

This ensures that:

  1. The collision probability stays meaningful across all widths
  2. The bound εN remains a relevant predictor of actual error
  3. The experiment demonstrates true scaling behavior, not saturation effects

Alternative Approach

Keep fixed N but scale the number of distinct items with width:
const unsigned zipf_lg_range = lg_width; // Range grows with width

This maintains a constant ratio of width to distinct items, avoiding the collision-free regime.

- Add error_vs_width profile C++ implementation
- Add Python visualization script for error vs width analysis
- Register profile in build system and main.cpp

This profile characterizes how Count-Min Sketch error decreases with
increasing sketch width across different width configurations.
c-dickens pushed a commit that referenced this pull request Feb 24, 2026
Consolidates the Count-Min Sketch frequency estimation characterization
from PRs #1, #2, and #3 into a single clean profile. The profile sweeps
across sketch widths (256-4096) with constant load factor (distinct/width
≈ 4), runs adaptive trials per width, and uses KLL sketches to track
error distribution quantiles (median, p75, p90, p95, max) for both
absolute and relative error metrics against theoretical bounds.

https://claude.ai/code/session_01RmEdWmm6vYXY3XevAAsWVe
c-dickens pushed a commit that referenced this pull request Feb 25, 2026
Consolidates the Count-Min Sketch frequency estimation characterization
from PRs #1, #2, and #3 into a single clean profile. The profile sweeps
across sketch widths (256-4096) with constant load factor (distinct/width
≈ 4), runs adaptive trials per width, and uses KLL sketches to track
error distribution quantiles (median, p75, p90, p95, max) for both
absolute and relative error metrics against theoretical bounds.

https://claude.ai/code/session_01RmEdWmm6vYXY3XevAAsWVe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant