Skip to content

VPAAMP-188 intermittently failing l2 test 6002#1347

Open
pstroffolino wants to merge 2 commits intodev_sprint_25_2from
feature/VPAAMP-188
Open

VPAAMP-188 intermittently failing l2 test 6002#1347
pstroffolino wants to merge 2 commits intodev_sprint_25_2from
feature/VPAAMP-188

Conversation

@pstroffolino
Copy link
Copy Markdown
Contributor

Summary

Fixes intermittent failure of test_6002[UnderflowMonitor_pause_resume_on_delayed_fragments_(EWMA_estimator)_1] when running on slower/variable CI environments.

Root Cause

When a stalled segment download is aborted (curl lowBWTimeout, error 18), a small initial burst of bytes (~16 KB) is received almost instantly before the server stall begins. This inflates m_progressBytesPerSecond to an unrealistically high value (100s of Mbit/s).

If no completed-sample history exists yet (first segment after tune), GetThroughputBytesPerSecond() returns that raw burst figure as the sole bandwidth estimate. ABR sees 300–560 Mbit/s and immediately ramps up to the highest profile — while the download is still stalling. Each attempt then stalls, cycles through all profiles (~9s × 3 profiles ≈ 27s per cycle), the buffer never fills above the underflow resume threshold, and the test times out at 80s.

The RMO estimator is unaffected because it has no in-flight progress concept and returns -1 (unknown) until a completed sample is available.

Fix

In HarmonicEwmaEstimator::UpdateDownloadMetrics(), invalidate the stale in-flight progress state after recording a completed (or aborted) sample:

m_progressBytesPerSecond = 0.0;
m_progressHasSample = false;
m_progressContextValid = false;

The completed sample in EWMA history now provides the authoritative low-bandwidth figure. The progress context is cleanly re-established by the first xferinfo() callback of the next download.

Testing

  • Existing HarmonicEwmaEstimatorTests unit tests: unaffected
  • L2 test test_6002 EWMA variant: expected to pass consistently with this fix

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an intermittent L2 test failure (test_6002 EWMA variant) by preventing stale in-flight progress measurements from being reused as the sole bandwidth estimate after a download completes or aborts, stabilizing ABR behavior on slow/variable CI environments.

Changes:

  • Invalidate EWMA “in-flight progress” state at the end of HarmonicEwmaEstimator::UpdateDownloadMetrics().
  • Ensure subsequent bandwidth queries rely on completed-sample history until new progress callbacks re-establish context.

Comment on lines +148 to +159
// Invalidate the stale in-flight progress estimate. The completed (or
// aborted) sample now provides the authoritative throughput figure.
// Without this, the initial burst of bytes at the start of a stalled
// download inflates m_progressBytesPerSecond to an unrealistically high
// value which — when no completed-sample history exists yet — is returned
// as the sole bandwidth estimate by GetThroughputBytesPerSecond(), causing
// ABR to ramp up to the highest profile mid-stall and enter a thrash loop.
// The progress context will be re-established by the first xferinfo()
// callback of the next download.
m_progressBytesPerSecond = 0.0;
m_progressHasSample = false;
m_progressContextValid = false;
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a subtle state transition, but there’s no unit test covering the regression scenario (in-flight progress sample present, then an aborted/completed UpdateDownloadMetrics call, and subsequent GetThroughputBytesPerSecond()/GetBandwidthBitsPerSecond() must not use the stale progress estimate before the next xferinfo callback). Consider adding a test that simulates: (1) UpdateDownloadProgress creates a high progress estimate, (2) UpdateDownloadMetrics is called with an aborted sample (e.g., 0 downloaded bytes), and (3) bandwidth/throughput no longer reflects the prior progress burst.

Copilot generated this review using guidance from repository custom instructions.
HarmonicEwmaEstimator: clear stale in-flight progress on download complete

When a stalled download is aborted (curl error 18 / lowBWTimeout), the
initial burst of bytes received before the server stall inflates
m_progressBytesPerSecond to an unrealistically high value. If no
completed-sample history exists yet (first segment after tune),
GetThroughputBytesPerSecond() returns that raw burst figure as the sole
bandwidth estimate, causing ABR to ramp up to the highest profile while
the download is still stalling. This triggers a thrash loop where every
profile cycles through its stall timeout (~9s each), the buffer never
accumulates enough to satisfy the underflow resume threshold, and
test_6002 (EWMA estimator variant) times out at 80s.

Fix: at the end of UpdateDownloadMetrics(), invalidate the stale
in-flight progress state (m_progressBytesPerSecond, m_progressHasSample,
m_progressContextValid). The completed/aborted sample in the EWMA history
now provides the authoritative low-bandwidth figure. The progress context
is re-established by the first xferinfo() callback of the next download,
so the min(blended, progress) capping continues to work correctly for
genuinely in-flight segments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants