fix(ci_visibility): handle rate limiting errors [backport #17170 to 4.5]#17215
Merged
fix(ci_visibility): handle rate limiting errors [backport #17170 to 4.5]#17215
Conversation
## Description HTTP 429 (Too Many Requests) responses from the Datadog backend were falling through to the `>= 400` branch in the backend connector, classifying them as non-retriable `CODE_4XX` errors. This caused CI visibility data to be silently dropped whenever the backend applied rate limiting. This fix: - Adds `RATE_LIMITED` to `ErrorType` with a distinct internal value (`"rate_limited"`) - Handles 429 responses before the generic 4xx check and marks them as `RATE_LIMITED` - Adds `RATE_LIMITED` to `RETRIABLE_ERRORS` so they are retried up to the existing retry limit - Parses the `X-RateLimit-Reset` response header to use as the retry delay when present (supports both Unix timestamps and durations in seconds), falling back to exponential backoff otherwise - Maps `RATE_LIMITED` to `status_code_4xx_response` in telemetry metrics for cross-language consistency ## Testing Unit tests added in `tests/testing/internal/test_http.py` covering: - 429 retried then succeeds - 429 hits retry limit - `X-RateLimit-Reset` as Unix timestamp → correct computed delay - `X-RateLimit-Reset` as duration in seconds → value used directly - Retry delay capped at `MAX_RETRY_AFTER_SECONDS` - Missing or invalid `X-RateLimit-Reset` → falls back to exponential backoff Telemetry test updated in `tests/testing/internal/test_telemetry.py` to verify `RATE_LIMITED` maps to `"status_code"` in `endpoint_payload.requests_errors` and to `"status_code_4xx_response"` in per-request error metrics. ## Risks Low. The change only affects the retry path for a previously unhandled status code. All other status codes follow the same logic as before. ## Additional Notes - `RATE_LIMITED` intentionally uses the internal value `"rate_limited"` (distinct from `CODE_4XX`) to avoid Python enum aliasing, which would have made `CODE_4XX` an alias and caused all 4xx responses to be retried. The mapping to the canonical telemetry value is done explicitly at emission time in `record_error()`. - The retry delay parsed from `X-RateLimit-Reset` is capped at `MAX_RETRY_AFTER_SECONDS` (120s) to prevent unreasonable waits — e.g., an expired Unix timestamp being misinterpreted as a duration of billions of seconds. Co-authored-by: federico.mon <federico.mon@datadoghq.com> (cherry picked from commit 0d92314)
P403n1x87
approved these changes
Mar 31, 2026
Codeowners resolved as |
juan-fernandez
approved these changes
Mar 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport 0d92314 from #17170 to 4.5.
Description
HTTP 429 (Too Many Requests) responses from the Datadog backend were falling through to the
>= 400branch in the backend connector, classifying them as non-retriableCODE_4XXerrors. This caused CI visibility data to be silently dropped whenever the backend applied rate limiting.This fix:
RATE_LIMITEDtoErrorTypewith a distinct internal value ("rate_limited")RATE_LIMITEDRATE_LIMITEDtoRETRIABLE_ERRORSso they are retried up to the existing retry limitX-RateLimit-Resetresponse header to use as the retry delay when present (supports both Unix timestamps and durations in seconds), falling back to exponential backoff otherwiseRATE_LIMITEDtostatus_code_4xx_responsein telemetry metrics for cross-language consistencyTesting
Unit tests added in
tests/testing/internal/test_http.pycovering:X-RateLimit-Resetas Unix timestamp → correct computed delayX-RateLimit-Resetas duration in seconds → value used directlyMAX_RETRY_AFTER_SECONDSX-RateLimit-Reset→ falls back to exponential backoffTelemetry test updated in
tests/testing/internal/test_telemetry.pyto verifyRATE_LIMITEDmaps to"status_code"inendpoint_payload.requests_errorsand to"status_code_4xx_response"in per-request error metrics.Risks
Low. The change only affects the retry path for a previously unhandled status code. All other status codes follow the same logic as before.
Additional Notes
RATE_LIMITEDintentionally uses the internal value"rate_limited"(distinct fromCODE_4XX) to avoid Python enum aliasing, which would have madeCODE_4XXan alias and caused all 4xx responses to be retried. The mapping to the canonical telemetry value is done explicitly at emission time inrecord_error().X-RateLimit-Resetis capped atMAX_RETRY_AFTER_SECONDS(120s) to prevent unreasonable waits — e.g., an expired Unix timestamp being misinterpreted as a duration of billions of seconds.