Skip to content

Fix out-of-order sample loss by making remote write synchronous#109

Open
cnewkirk wants to merge 1 commit intoOpenNMS:masterfrom
cnewkirk:bugfix/improve-prom-writespec-compliance
Open

Fix out-of-order sample loss by making remote write synchronous#109
cnewkirk wants to merge 1 commit intoOpenNMS:masterfrom
cnewkirk:bugfix/improve-prom-writespec-compliance

Conversation

@cnewkirk
Copy link
Copy Markdown
Contributor

The store() method previously fired HTTP writes asynchronously via executeAsync() and returned immediately. When the RingBuffer's multiple worker threads dispatched consecutive batches containing samples for the same series, the async HTTP requests could arrive at the remote write endpoint out of timestamp order, causing the backend to reject the stale samples as out-of-order.

This change makes store() block until the HTTP write completes, ensuring the ring buffer worker thread does not process the next batch until the current write has landed. This preserves per-series timestamp ordering across consecutive WriteRequests as required by the Prometheus Remote Write spec.

Additionally fixes a bug where samplesLost incorrectly counted unfiltered samples (including NaN) instead of the actual samples that were attempted.

Validated via A/B E2E testing against Thanos Receive:

  • Baseline (async): 14 out-of-order / 5,262 appended (0.27% loss)
  • Fix (sync): 0 out-of-order / 5,264 appended (0.00% loss)
  • Throughput: identical (~5,260 samples over equal soak periods)
  • All 45 smoke tests passing on both runs

The store() method previously fired HTTP writes asynchronously via
executeAsync() and returned immediately. When the RingBuffer's
multiple worker threads dispatched consecutive batches containing
samples for the same series, the async HTTP requests could arrive
at the remote write endpoint out of timestamp order, causing the
backend to reject the stale samples as out-of-order.

This change makes store() block until the HTTP write completes,
ensuring the ring buffer worker thread does not process the next
batch until the current write has landed. This preserves per-series
timestamp ordering across consecutive WriteRequests as required by
the Prometheus Remote Write spec.

Additionally fixes a bug where samplesLost incorrectly counted
unfiltered samples (including NaN) instead of the actual samples
that were attempted.

Validated via A/B E2E testing against Thanos Receive:
  - Baseline (async): 14 out-of-order / 5,262 appended (0.27% loss)
  - Fix (sync):        0 out-of-order / 5,264 appended (0.00% loss)
  - Throughput: identical (~5,260 samples over equal soak periods)
  - All 45 smoke tests passing on both runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@marshallmassengill
Copy link
Copy Markdown

Created https://opennms.atlassian.net/browse/NMS-19647 for this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants