VPAAMP-188 intermittently failing l2 test 6002#1347
VPAAMP-188 intermittently failing l2 test 6002#1347pstroffolino wants to merge 2 commits intodev_sprint_25_2from
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses an intermittent L2 test failure (test_6002 EWMA variant) by preventing stale in-flight progress measurements from being reused as the sole bandwidth estimate after a download completes or aborts, stabilizing ABR behavior on slow/variable CI environments.
Changes:
- Invalidate EWMA “in-flight progress” state at the end of
HarmonicEwmaEstimator::UpdateDownloadMetrics(). - Ensure subsequent bandwidth queries rely on completed-sample history until new progress callbacks re-establish context.
| // Invalidate the stale in-flight progress estimate. The completed (or | ||
| // aborted) sample now provides the authoritative throughput figure. | ||
| // Without this, the initial burst of bytes at the start of a stalled | ||
| // download inflates m_progressBytesPerSecond to an unrealistically high | ||
| // value which — when no completed-sample history exists yet — is returned | ||
| // as the sole bandwidth estimate by GetThroughputBytesPerSecond(), causing | ||
| // ABR to ramp up to the highest profile mid-stall and enter a thrash loop. | ||
| // The progress context will be re-established by the first xferinfo() | ||
| // callback of the next download. | ||
| m_progressBytesPerSecond = 0.0; | ||
| m_progressHasSample = false; | ||
| m_progressContextValid = false; |
There was a problem hiding this comment.
This change fixes a subtle state transition, but there’s no unit test covering the regression scenario (in-flight progress sample present, then an aborted/completed UpdateDownloadMetrics call, and subsequent GetThroughputBytesPerSecond()/GetBandwidthBitsPerSecond() must not use the stale progress estimate before the next xferinfo callback). Consider adding a test that simulates: (1) UpdateDownloadProgress creates a high progress estimate, (2) UpdateDownloadMetrics is called with an aborted sample (e.g., 0 downloaded bytes), and (3) bandwidth/throughput no longer reflects the prior progress burst.
HarmonicEwmaEstimator: clear stale in-flight progress on download complete When a stalled download is aborted (curl error 18 / lowBWTimeout), the initial burst of bytes received before the server stall inflates m_progressBytesPerSecond to an unrealistically high value. If no completed-sample history exists yet (first segment after tune), GetThroughputBytesPerSecond() returns that raw burst figure as the sole bandwidth estimate, causing ABR to ramp up to the highest profile while the download is still stalling. This triggers a thrash loop where every profile cycles through its stall timeout (~9s each), the buffer never accumulates enough to satisfy the underflow resume threshold, and test_6002 (EWMA estimator variant) times out at 80s. Fix: at the end of UpdateDownloadMetrics(), invalidate the stale in-flight progress state (m_progressBytesPerSecond, m_progressHasSample, m_progressContextValid). The completed/aborted sample in the EWMA history now provides the authoritative low-bandwidth figure. The progress context is re-established by the first xferinfo() callback of the next download, so the min(blended, progress) capping continues to work correctly for genuinely in-flight segments.
8bcde49 to
208a311
Compare
Summary
Fixes intermittent failure of
test_6002[UnderflowMonitor_pause_resume_on_delayed_fragments_(EWMA_estimator)_1]when running on slower/variable CI environments.Root Cause
When a stalled segment download is aborted (curl
lowBWTimeout, error 18), a small initial burst of bytes (~16 KB) is received almost instantly before the server stall begins. This inflatesm_progressBytesPerSecondto an unrealistically high value (100s of Mbit/s).If no completed-sample history exists yet (first segment after tune),
GetThroughputBytesPerSecond()returns that raw burst figure as the sole bandwidth estimate. ABR sees 300–560 Mbit/s and immediately ramps up to the highest profile — while the download is still stalling. Each attempt then stalls, cycles through all profiles (~9s × 3 profiles ≈ 27s per cycle), the buffer never fills above the underflow resume threshold, and the test times out at 80s.The RMO estimator is unaffected because it has no in-flight progress concept and returns -1 (unknown) until a completed sample is available.
Fix
In
HarmonicEwmaEstimator::UpdateDownloadMetrics(), invalidate the stale in-flight progress state after recording a completed (or aborted) sample:The completed sample in EWMA history now provides the authoritative low-bandwidth figure. The progress context is cleanly re-established by the first
xferinfo()callback of the next download.Testing
HarmonicEwmaEstimatorTestsunit tests: unaffectedtest_6002EWMA variant: expected to pass consistently with this fix