Skip to content

fix(ci_visibility): handle rate limiting errors [backport #17170 to 4.5]#17215

Merged
gnufede merged 1 commit into4.5from
backport-17170-to-4.5
Apr 1, 2026
Merged

fix(ci_visibility): handle rate limiting errors [backport #17170 to 4.5]#17215
gnufede merged 1 commit into4.5from
backport-17170-to-4.5

Conversation

@gnufede
Copy link
Copy Markdown
Member

@gnufede gnufede commented Mar 31, 2026

Backport 0d92314 from #17170 to 4.5.

Description

HTTP 429 (Too Many Requests) responses from the Datadog backend were falling through to the >= 400 branch in the backend connector, classifying them as non-retriable CODE_4XX errors. This caused CI visibility data to be silently dropped whenever the backend applied rate limiting.

This fix:

  • Adds RATE_LIMITED to ErrorType with a distinct internal value ("rate_limited")
  • Handles 429 responses before the generic 4xx check and marks them as RATE_LIMITED
  • Adds RATE_LIMITED to RETRIABLE_ERRORS so they are retried up to the existing retry limit
  • Parses the X-RateLimit-Reset response header to use as the retry delay when present (supports both Unix timestamps and durations in seconds), falling back to exponential backoff otherwise
  • Maps RATE_LIMITED to status_code_4xx_response in telemetry metrics for cross-language consistency

Testing

Unit tests added in tests/testing/internal/test_http.py covering:

  • 429 retried then succeeds
  • 429 hits retry limit
  • X-RateLimit-Reset as Unix timestamp → correct computed delay
  • X-RateLimit-Reset as duration in seconds → value used directly
  • Retry delay capped at MAX_RETRY_AFTER_SECONDS
  • Missing or invalid X-RateLimit-Reset → falls back to exponential backoff

Telemetry test updated in tests/testing/internal/test_telemetry.py to verify RATE_LIMITED maps to "status_code" in endpoint_payload.requests_errors and to "status_code_4xx_response" in per-request error metrics.

Risks

Low. The change only affects the retry path for a previously unhandled status code. All other status codes follow the same logic as before.

Additional Notes

  • RATE_LIMITED intentionally uses the internal value "rate_limited" (distinct from CODE_4XX) to avoid Python enum aliasing, which would have made CODE_4XX an alias and caused all 4xx responses to be retried. The mapping to the canonical telemetry value is done explicitly at emission time in record_error().
  • The retry delay parsed from X-RateLimit-Reset is capped at MAX_RETRY_AFTER_SECONDS (120s) to prevent unreasonable waits — e.g., an expired Unix timestamp being misinterpreted as a duration of billions of seconds.

## Description

HTTP 429 (Too Many Requests) responses from the Datadog backend were falling through to the `>= 400` branch in the backend connector, classifying them as non-retriable `CODE_4XX` errors. This caused CI visibility data to be silently dropped whenever the backend applied rate limiting.

This fix:
- Adds `RATE_LIMITED` to `ErrorType` with a distinct internal value (`"rate_limited"`)
- Handles 429 responses before the generic 4xx check and marks them as `RATE_LIMITED`
- Adds `RATE_LIMITED` to `RETRIABLE_ERRORS` so they are retried up to the existing retry limit
- Parses the `X-RateLimit-Reset` response header to use as the retry delay when present (supports both Unix timestamps and durations in seconds), falling back to exponential backoff otherwise
- Maps `RATE_LIMITED` to `status_code_4xx_response` in telemetry metrics for cross-language consistency

## Testing

Unit tests added in `tests/testing/internal/test_http.py` covering:
- 429 retried then succeeds
- 429 hits retry limit
- `X-RateLimit-Reset` as Unix timestamp → correct computed delay
- `X-RateLimit-Reset` as duration in seconds → value used directly
- Retry delay capped at `MAX_RETRY_AFTER_SECONDS`
- Missing or invalid `X-RateLimit-Reset` → falls back to exponential backoff

Telemetry test updated in `tests/testing/internal/test_telemetry.py` to verify `RATE_LIMITED` maps to `"status_code"` in `endpoint_payload.requests_errors` and to `"status_code_4xx_response"` in per-request error metrics.

## Risks

Low. The change only affects the retry path for a previously unhandled status code. All other status codes follow the same logic as before.

## Additional Notes

- `RATE_LIMITED` intentionally uses the internal value `"rate_limited"` (distinct from `CODE_4XX`) to avoid Python enum aliasing, which would have made `CODE_4XX` an alias and caused all 4xx responses to be retried. The mapping to the canonical telemetry value is done explicitly at emission time in `record_error()`.
- The retry delay parsed from `X-RateLimit-Reset` is capped at `MAX_RETRY_AFTER_SECONDS` (120s) to prevent unreasonable waits — e.g., an expired Unix timestamp being misinterpreted as a duration of billions of seconds.

Co-authored-by: federico.mon <federico.mon@datadoghq.com>
(cherry picked from commit 0d92314)
@gnufede gnufede requested review from a team as code owners March 31, 2026 14:04
@gnufede gnufede added the CI App label Mar 31, 2026
@gnufede gnufede requested review from P403n1x87 and tabgok March 31, 2026 14:04
@gnufede gnufede self-assigned this Mar 31, 2026
@gnufede gnufede enabled auto-merge (squash) March 31, 2026 14:04
@cit-pr-commenter-54b7da
Copy link
Copy Markdown

Codeowners resolved as

ddtrace/testing/internal/http.py                                        @DataDog/ci-app-libraries
ddtrace/testing/internal/telemetry.py                                   @DataDog/ci-app-libraries
releasenotes/notes/ci-visibility-handle-rate-limiting-d7df3d047661bbd9.yaml  @DataDog/apm-python
tests/testing/internal/test_http.py                                     @DataDog/ci-app-libraries
tests/testing/internal/test_telemetry.py                                @DataDog/ci-app-libraries

@gnufede gnufede merged commit 6f6ce8c into 4.5 Apr 1, 2026
1025 of 1028 checks passed
@gnufede gnufede deleted the backport-17170-to-4.5 branch April 1, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants