fix(ci_visibility): handle rate limiting errors [backport #17170 to 4.5] by gnufede · Pull Request #17215 · DataDog/dd-trace-py

gnufede · 2026-03-31T14:04:18Z

Backport 0d92314 from #17170 to 4.5.

Description

HTTP 429 (Too Many Requests) responses from the Datadog backend were falling through to the >= 400 branch in the backend connector, classifying them as non-retriable CODE_4XX errors. This caused CI visibility data to be silently dropped whenever the backend applied rate limiting.

This fix:

Adds RATE_LIMITED to ErrorType with a distinct internal value ("rate_limited")
Handles 429 responses before the generic 4xx check and marks them as RATE_LIMITED
Adds RATE_LIMITED to RETRIABLE_ERRORS so they are retried up to the existing retry limit
Parses the X-RateLimit-Reset response header to use as the retry delay when present (supports both Unix timestamps and durations in seconds), falling back to exponential backoff otherwise
Maps RATE_LIMITED to status_code_4xx_response in telemetry metrics for cross-language consistency

Testing

Unit tests added in tests/testing/internal/test_http.py covering:

429 retried then succeeds
429 hits retry limit
X-RateLimit-Reset as Unix timestamp → correct computed delay
X-RateLimit-Reset as duration in seconds → value used directly
Retry delay capped at MAX_RETRY_AFTER_SECONDS
Missing or invalid X-RateLimit-Reset → falls back to exponential backoff

Telemetry test updated in tests/testing/internal/test_telemetry.py to verify RATE_LIMITED maps to "status_code" in endpoint_payload.requests_errors and to "status_code_4xx_response" in per-request error metrics.

Risks

Low. The change only affects the retry path for a previously unhandled status code. All other status codes follow the same logic as before.

Additional Notes

RATE_LIMITED intentionally uses the internal value "rate_limited" (distinct from CODE_4XX) to avoid Python enum aliasing, which would have made CODE_4XX an alias and caused all 4xx responses to be retried. The mapping to the canonical telemetry value is done explicitly at emission time in record_error().
The retry delay parsed from X-RateLimit-Reset is capped at MAX_RETRY_AFTER_SECONDS (120s) to prevent unreasonable waits — e.g., an expired Unix timestamp being misinterpreted as a duration of billions of seconds.

## Description HTTP 429 (Too Many Requests) responses from the Datadog backend were falling through to the `>= 400` branch in the backend connector, classifying them as non-retriable `CODE_4XX` errors. This caused CI visibility data to be silently dropped whenever the backend applied rate limiting. This fix: - Adds `RATE_LIMITED` to `ErrorType` with a distinct internal value (`"rate_limited"`) - Handles 429 responses before the generic 4xx check and marks them as `RATE_LIMITED` - Adds `RATE_LIMITED` to `RETRIABLE_ERRORS` so they are retried up to the existing retry limit - Parses the `X-RateLimit-Reset` response header to use as the retry delay when present (supports both Unix timestamps and durations in seconds), falling back to exponential backoff otherwise - Maps `RATE_LIMITED` to `status_code_4xx_response` in telemetry metrics for cross-language consistency ## Testing Unit tests added in `tests/testing/internal/test_http.py` covering: - 429 retried then succeeds - 429 hits retry limit - `X-RateLimit-Reset` as Unix timestamp → correct computed delay - `X-RateLimit-Reset` as duration in seconds → value used directly - Retry delay capped at `MAX_RETRY_AFTER_SECONDS` - Missing or invalid `X-RateLimit-Reset` → falls back to exponential backoff Telemetry test updated in `tests/testing/internal/test_telemetry.py` to verify `RATE_LIMITED` maps to `"status_code"` in `endpoint_payload.requests_errors` and to `"status_code_4xx_response"` in per-request error metrics. ## Risks Low. The change only affects the retry path for a previously unhandled status code. All other status codes follow the same logic as before. ## Additional Notes - `RATE_LIMITED` intentionally uses the internal value `"rate_limited"` (distinct from `CODE_4XX`) to avoid Python enum aliasing, which would have made `CODE_4XX` an alias and caused all 4xx responses to be retried. The mapping to the canonical telemetry value is done explicitly at emission time in `record_error()`. - The retry delay parsed from `X-RateLimit-Reset` is capped at `MAX_RETRY_AFTER_SECONDS` (120s) to prevent unreasonable waits — e.g., an expired Unix timestamp being misinterpreted as a duration of billions of seconds. Co-authored-by: federico.mon <federico.mon@datadoghq.com> (cherry picked from commit 0d92314)

cit-pr-commenter-54b7da · 2026-03-31T14:09:42Z

Codeowners resolved as

ddtrace/testing/internal/http.py                                        @DataDog/ci-app-libraries
ddtrace/testing/internal/telemetry.py                                   @DataDog/ci-app-libraries
releasenotes/notes/ci-visibility-handle-rate-limiting-d7df3d047661bbd9.yaml  @DataDog/apm-python
tests/testing/internal/test_http.py                                     @DataDog/ci-app-libraries
tests/testing/internal/test_telemetry.py                                @DataDog/ci-app-libraries

gnufede requested review from a team as code owners March 31, 2026 14:04

gnufede added the CI App label Mar 31, 2026

gnufede requested review from P403n1x87 and tabgok March 31, 2026 14:04

gnufede self-assigned this Mar 31, 2026

gnufede enabled auto-merge (squash) March 31, 2026 14:04

P403n1x87 approved these changes Mar 31, 2026

View reviewed changes

juan-fernandez approved these changes Mar 31, 2026

View reviewed changes

gnufede merged commit 6f6ce8c into 4.5 Apr 1, 2026
1025 of 1028 checks passed

gnufede deleted the backport-17170-to-4.5 branch April 1, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci_visibility): handle rate limiting errors [backport #17170 to 4.5]#17215

fix(ci_visibility): handle rate limiting errors [backport #17170 to 4.5]#17215
gnufede merged 1 commit into4.5from
backport-17170-to-4.5

gnufede commented Mar 31, 2026

Uh oh!

cit-pr-commenter-54b7da bot commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gnufede commented Mar 31, 2026

Description

Testing

Risks

Additional Notes

Uh oh!

cit-pr-commenter-54b7da bot commented Mar 31, 2026

Codeowners resolved as

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants