Fix out-of-order sample loss by making remote write synchronous by cnewkirk · Pull Request #109 · OpenNMS/opennms-cortex-tss-plugin

cnewkirk · 2026-03-16T03:46:19Z

The store() method previously fired HTTP writes asynchronously via executeAsync() and returned immediately. When the RingBuffer's multiple worker threads dispatched consecutive batches containing samples for the same series, the async HTTP requests could arrive at the remote write endpoint out of timestamp order, causing the backend to reject the stale samples as out-of-order.

This change makes store() block until the HTTP write completes, ensuring the ring buffer worker thread does not process the next batch until the current write has landed. This preserves per-series timestamp ordering across consecutive WriteRequests as required by the Prometheus Remote Write spec.

Additionally fixes a bug where samplesLost incorrectly counted unfiltered samples (including NaN) instead of the actual samples that were attempted.

Validated via A/B E2E testing against Thanos Receive:

Baseline (async): 14 out-of-order / 5,262 appended (0.27% loss)
Fix (sync): 0 out-of-order / 5,264 appended (0.00% loss)
Throughput: identical (~5,260 samples over equal soak periods)
All 45 smoke tests passing on both runs

The store() method previously fired HTTP writes asynchronously via executeAsync() and returned immediately. When the RingBuffer's multiple worker threads dispatched consecutive batches containing samples for the same series, the async HTTP requests could arrive at the remote write endpoint out of timestamp order, causing the backend to reject the stale samples as out-of-order. This change makes store() block until the HTTP write completes, ensuring the ring buffer worker thread does not process the next batch until the current write has landed. This preserves per-series timestamp ordering across consecutive WriteRequests as required by the Prometheus Remote Write spec. Additionally fixes a bug where samplesLost incorrectly counted unfiltered samples (including NaN) instead of the actual samples that were attempted. Validated via A/B E2E testing against Thanos Receive: - Baseline (async): 14 out-of-order / 5,262 appended (0.27% loss) - Fix (sync): 0 out-of-order / 5,264 appended (0.00% loss) - Throughput: identical (~5,260 samples over equal soak periods) - All 45 smoke tests passing on both runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

marshallmassengill · 2026-03-30T12:35:39Z

Created https://opennms.atlassian.net/browse/NMS-19647 for this one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix out-of-order sample loss by making remote write synchronous#109

Fix out-of-order sample loss by making remote write synchronous#109
cnewkirk wants to merge 1 commit intoOpenNMS:masterfrom
cnewkirk:bugfix/improve-prom-writespec-compliance

cnewkirk commented Mar 16, 2026

Uh oh!

marshallmassengill commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cnewkirk commented Mar 16, 2026

Uh oh!

marshallmassengill commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants