fix(aggregators): Prevent duplicate push on early timer fire#18780
Draft
skartikey wants to merge 3 commits intoinfluxdata:masterfrom
Draft
fix(aggregators): Prevent duplicate push on early timer fire#18780skartikey wants to merge 3 commits intoinfluxdata:masterfrom
skartikey wants to merge 3 commits intoinfluxdata:masterfrom
Conversation
Go's runtime timers may fire slightly before their deadline on platforms with coarse timer granularity such as Windows and WSL2. When that happens, Agent.push() calls aggregator.Push() with a wall-clock time still inside the previous window, which trips the clock-drift safeguard in RunningAggregator.Push() and keeps periodEnd unchanged. The agent loop then immediately issues a second Push for the same period, producing duplicate metrics. Re-check time.Until(EndPeriod()) after the timer returns and sleep any remaining time so Push always runs at or after the window end, never before. The inner select preserves context cancellation. Part of fix for influxdata#18066
The safeguard added in PR influxdata#16375 resets the aggregation window whenever the current wall-clock time is not inside the next window. That condition triggers for any backward skew, including sub-ms jitter from a Go timer firing slightly early, in which case the window is recomputed to the same slot and Push runs twice for the same period. Widen the threshold so the window is only reset when the wall clock is outside the expected window by at least one full period. Hibernation and NTP step adjustments, which are seconds to minutes of skew, still trigger a reset as intended. Part of fix for influxdata#18066
… reset Add two regression tests for RunningAggregator.Push() window transitions: 1. When Push runs with time.Now() still inside the current window (simulating a Go timer firing slightly before the deadline), the window must advance by exactly one period. Before loosening the drift safeguard threshold, the condition nowWall.Before(since) tripped for any sub-period skew and rewound the window to the current slot, leaving periodEnd unchanged. 2. When the wall clock is more than one period beyond the expected window (simulating hibernation or an NTP step adjustment), the safeguard from PR influxdata#16375 must still reset the window near the current time rather than leaving periodEnd stuck in the past. Part of fix for influxdata#18066
4057bb1 to
f7f411d
Compare
Contributor
Author
|
@srebhan I pushed two fixes (model threshold + agent clamp) and tests, but honestly I think the agent clamp is redundant. Once the model fix lets Push advance the window correctly on an early fire, the duplicate chain breaks on its own. Either fix works in isolation; keeping both feels like belt-and-suspenders in a hot loop. Leaning toward dropping the |
Contributor
|
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Aggregator plugins with a configured
periodemit two flushed batches at every period boundary instead of one. This is reported on Docker Desktop for Windows and WSL2 but can happen on any platform where Go's runtime timer fires slightly before its deadline.Root cause
Two pieces of code interact badly at the period boundary:
agent/agent.go(*Agent).push()schedules the next flush withtime.After(time.Until(aggregator.EndPeriod())). Go's timers can fire a few ms early, especially with coarse OS timer granularity (Windows default is ~15.6 ms).models/running_aggregator.goRunningAggregator.Push()contains a safeguard added in fix(aggregators): Handle time drift when calculating aggregation windows #16375 to reset the window on clock jumps / hibernation. The condition isnowWall.Before(since)with no threshold, so sub-millisecond skew from an early timer fire trips it. The window is recomputed to the same slot,periodEnddoes not advance, and the agent loop (seeingtime.Until(EndPeriod())is now negative) immediately fires a secondPushfor the same period.Trace from the reported log:
08:12:59.998- timer fires 2 ms early, safeguard rewinds window to[08:12:00, 08:13:00], Push runs.08:13:00.054- agent loop fires again (EndPeriod unchanged), Push runs a second time.Checklist
Related issues
resolves #18066