Skip to content

fix(aggregators): Prevent duplicate push on early timer fire#18780

Draft
skartikey wants to merge 3 commits intoinfluxdata:masterfrom
skartikey:fix/aggregator-duplicate-push
Draft

fix(aggregators): Prevent duplicate push on early timer fire#18780
skartikey wants to merge 3 commits intoinfluxdata:masterfrom
skartikey:fix/aggregator-duplicate-push

Conversation

@skartikey
Copy link
Copy Markdown
Contributor

Summary

Aggregator plugins with a configured period emit two flushed batches at every period boundary instead of one. This is reported on Docker Desktop for Windows and WSL2 but can happen on any platform where Go's runtime timer fires slightly before its deadline.

Root cause

Two pieces of code interact badly at the period boundary:

  1. agent/agent.go (*Agent).push() schedules the next flush with time.After(time.Until(aggregator.EndPeriod())). Go's timers can fire a few ms early, especially with coarse OS timer granularity (Windows default is ~15.6 ms).
  2. models/running_aggregator.go RunningAggregator.Push() contains a safeguard added in fix(aggregators): Handle time drift when calculating aggregation windows #16375 to reset the window on clock jumps / hibernation. The condition is nowWall.Before(since) with no threshold, so sub-millisecond skew from an early timer fire trips it. The window is recomputed to the same slot, periodEnd does not advance, and the agent loop (seeing time.Until(EndPeriod()) is now negative) immediately fires a second Push for the same period.

Trace from the reported log:

  • 08:12:59.998 - timer fires 2 ms early, safeguard rewinds window to [08:12:00, 08:13:00], Push runs.
  • 08:13:00.054 - agent loop fires again (EndPeriod unchanged), Push runs a second time.

Checklist

Related issues

resolves #18066

@skartikey skartikey marked this pull request as draft April 22, 2026 19:48
@telegraf-tiger telegraf-tiger Bot added the fix pr to fix corresponding bug label Apr 22, 2026
Go's runtime timers may fire slightly before their deadline on
platforms with coarse timer granularity such as Windows and WSL2.
When that happens, Agent.push() calls aggregator.Push() with a
wall-clock time still inside the previous window, which trips the
clock-drift safeguard in RunningAggregator.Push() and keeps
periodEnd unchanged. The agent loop then immediately issues a
second Push for the same period, producing duplicate metrics.

Re-check time.Until(EndPeriod()) after the timer returns and sleep
any remaining time so Push always runs at or after the window end,
never before. The inner select preserves context cancellation.

Part of fix for influxdata#18066
The safeguard added in PR influxdata#16375 resets the aggregation window
whenever the current wall-clock time is not inside the next window.
That condition triggers for any backward skew, including sub-ms
jitter from a Go timer firing slightly early, in which case the
window is recomputed to the same slot and Push runs twice for the
same period.

Widen the threshold so the window is only reset when the wall
clock is outside the expected window by at least one full period.
Hibernation and NTP step adjustments, which are seconds to minutes
of skew, still trigger a reset as intended.

Part of fix for influxdata#18066
… reset

Add two regression tests for RunningAggregator.Push() window
transitions:

1. When Push runs with time.Now() still inside the current window
   (simulating a Go timer firing slightly before the deadline), the
   window must advance by exactly one period. Before loosening the
   drift safeguard threshold, the condition nowWall.Before(since)
   tripped for any sub-period skew and rewound the window to the
   current slot, leaving periodEnd unchanged.

2. When the wall clock is more than one period beyond the expected
   window (simulating hibernation or an NTP step adjustment), the
   safeguard from PR influxdata#16375 must still reset the window near the
   current time rather than leaving periodEnd stuck in the past.

Part of fix for influxdata#18066
@skartikey skartikey force-pushed the fix/aggregator-duplicate-push branch from 4057bb1 to f7f411d Compare April 23, 2026 12:45
@skartikey
Copy link
Copy Markdown
Contributor Author

@srebhan I pushed two fixes (model threshold + agent clamp) and tests, but honestly I think the agent clamp is redundant. Once the model fix lets Push advance the window correctly on an early fire, the duplicate chain breaks on its own. Either fix works in isolation; keeping both feels like belt-and-suspenders in a hot loop. Leaning toward dropping the agent.go change and keeping just the model threshold loosening, but wanted your take before I trim it.

@telegraf-tiger
Copy link
Copy Markdown
Contributor

@srebhan srebhan self-assigned this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix pr to fix corresponding bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Aggregator plugin executes twice per period on Docker Desktop for Windows

2 participants