[CI] Add Kubernetes memory limits to docker_image_build_otel job#49233
[CI] Add Kubernetes memory limits to docker_image_build_otel job#49233
docker_image_build_otel job#49233Conversation
The docker_image_build_otel CI job has a ~12% flaky failure rate caused by OOM kills during `go mod download` of 40+ OTel modules inside a Docker-in-Docker build. Unlike its sibling jobs (integration_tests_otel at 16Gi, datadog_otel_components_ocb_build at 32Gi), this job had no Kubernetes memory request/limit, relying on runner defaults. Set KUBERNETES_MEMORY_REQUEST and KUBERNETES_MEMORY_LIMIT to 32Gi (matching the OCB build job which performs the same workload class) to give the pod Guaranteed QoS and prevent OOM kills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. Nice work! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Gitlab CI Configuration ChangesModified Jobsdocker_image_build_otel docker_image_build_otel:
before_script:
- mkdir -p /tmp/otel-ci
- cp comp/otelcol/collector-contrib/impl/manifest.yaml /tmp/otel-ci/
- cp Dockerfiles/agent-ddot/Dockerfile.agent-otel /tmp/otel-ci/
- cp test/integration/docker/otel_agent_build_tests.py /tmp/otel-ci/
- export OTELCOL_VERSION=$(yq eval '.processors[] | select(.gomod | test("opentelemetry-collector-contrib"))
| .gomod' /tmp/otel-ci/manifest.yaml | head -1 | awk '{print $NF}')
- 'yq eval -i ''.receivers += [{"gomod": "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/k8sobjectsreceiver
" + env(OTELCOL_VERSION)}]'' /tmp/otel-ci/manifest.yaml'
- 'yq eval -i ''.processors += [{"gomod": "github.com/open-telemetry/opentelemetry-collector-contrib/processor/metricstransformprocessor
" + env(OTELCOL_VERSION)}]'' /tmp/otel-ci/manifest.yaml'
image: registry.ddbuild.io/ci/datadog-agent-buildimages/docker_x64$CI_IMAGE_DOCKER_X64_SUFFIX:$CI_IMAGE_DOCKER_X64
needs:
- integration_tests_otel
rules:
- if: $CI_COMMIT_BRANCH =~ /^mq-working-branch-/
when: never
- when: on_success
script:
- DOCKER_LOGIN=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $DOCKER_REGISTRY_RO user)
|| exit $?
- $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $DOCKER_REGISTRY_RO token | docker login
--username "$DOCKER_LOGIN" --password-stdin "$DOCKER_REGISTRY_URL"
- EXIT="${PIPESTATUS[0]}"; if [ $EXIT -ne 0 ]; then echo "Unable to locate credentials
needs gitlab runner restart"; exit $EXIT; fi
- export BUILDIMAGES_COMMIT="${CI_IMAGE_LINUX#*-}"
- export DDA_VERSION="$(curl -s https://raw.githubusercontent.com/DataDog/datadog-agent-buildimages/${BUILDIMAGES_COMMIT}/dda.env
| awk -F= '/^DDA_VERSION=/ {print $2}')"
- 'docker build \
--build-arg AGENT_VERSION=$CI_COMMIT_REF_NAME \
--build-arg AGENT_GIT_REF=$CI_COMMIT_REF_NAME \
--build-arg AGENT_DOCKER_REPO=datadog/agent-dev \
--build-arg AGENT_DOCKER_TAG=nightly-full-main-jmx \
--build-arg BASE_IMAGE_REGISTRY=registry.ddbuild.io/images/mirror \
--build-arg CI \
--build-arg DDA_VERSION=$DDA_VERSION \
--build-arg GOPROXY \
--build-arg GONOSUMDB \
--tag agent-byoc:latest \
-f /tmp/otel-ci/Dockerfile.agent-otel \
/tmp/otel-ci
'
- 'OT_AGENT_IMAGE_NAME=agent-byoc \
OT_AGENT_TAG=latest python3 \
/tmp/otel-ci/otel_agent_build_tests.py
'
stage: integration_test
tags:
- docker-in-docker:amd64
+ variables:
+ KUBERNETES_MEMORY_LIMIT: 32Gi
+ KUBERNETES_MEMORY_REQUEST: 32GiChanges Summary
ℹ️ Diff available in the job log. |
Files inventory check summaryFile checks results against ancestor 9d72b946: Results for datadog-agent_7.79.0~devel.git.657.8658ed3.pipeline.107264510-1_amd64.deb:No change detected |
Ishirui
left a comment
There was a problem hiding this comment.
Thanks a lot for investigating !
docker_image_build_otel job
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: d575f76 ❌ Experiments with retried target crashesThis is a critical error. One or more replicates failed with a non-zero exit code. These replicates may have been retried. See Replicate Execution Details for more information.
Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -3.39 | [-6.31, -0.48] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +2.15 | [+1.98, +2.31] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +1.40 | [+1.17, +1.64] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_logs | % cpu utilization | +0.91 | [-0.76, +2.59] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics | memory utilization | +0.31 | [+0.13, +0.50] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.27 | [+0.24, +0.31] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | +0.20 | [+0.13, +0.26] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.18 | [+0.13, +0.24] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.11 | [+0.03, +0.19] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.02 | [-0.09, +0.12] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.19, +0.22] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.01 | [-0.51, +0.53] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.11, +0.12] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.00 | [-0.21, +0.20] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.02 | [-0.07, +0.04] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.03 | [-0.47, +0.42] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.05 | [-0.43, +0.33] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.09 | [-0.15, -0.02] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.09 | [-0.25, +0.07] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.19 | [-0.32, -0.05] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.20 | [-0.37, -0.03] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.22 | [-0.32, -0.12] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.69 | [-0.91, -0.46] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -3.39 | [-6.31, -0.48] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 721 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 279.44MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 702 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 176.38MiB ≤ 181MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 496.24MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 209.66MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 339.40 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 424.98MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Replicate Execution Details
We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.
Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.
| Experiment | Variant | Replicates | Failure | Logs | Debug Dashboard |
|---|---|---|---|---|---|
| experiment_with_failures | baseline | 0 (x8) 1 (x6) | Oom killed | Debug Dashboard |
The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.
❌ Retried Normal Replicate Execution Failures (non-profiling)
| Experiment | Variant | Replicates | Failure | Debug Dashboard |
|---|---|---|---|---|
| docker_containers_cpu | baseline | 2 | Crashed (exit code: 1) | Debug Dashboard |
| docker_containers_memory | comparison | 6 | Crashed (exit code: 1) | Debug Dashboard |
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
What does this PR do?
Adds
KUBERNETES_MEMORY_REQUESTandKUBERNETES_MEMORY_LIMIT(32Gi) to thedocker_image_build_otelCI job, which was the only OTel integration test job without Kubernetes memory limits.Motivation
The
docker_image_build_oteljob has a ~12% flaky failure rate onmain, caused by OOM kills duringgo mod downloadof 40+ OTel modules inside a Docker-in-Docker build (see Slack thread).Root cause analysis (posted in thread):
docker/dockerv28 tomoby/mobyv29 #48777 (one failure on Apr 9, a day before the PR merged)go mod download— classic OOM patternintegration_tests_otel: 16Gidatadog_otel_components_ocb_build: 32Gi + 16 CPUdocker_image_build_otelhad none, relying on runner defaultsSetting REQUEST = LIMIT = 32Gi gives the pod a Guaranteed QoS class in Kubernetes, matching the OCB build job which performs the same workload (OCB generation + Go compilation).
Describe how you validated your changes
datadog_otel_components_ocb_buildwhich does the same workAdditional Notes
The
.ddot_byoc_oci_build_testtemplate andddot_byoc_binary_build_test_ubuntu2004job in the same file also lack memory limits and perform similar Docker builds. They may benefit from the same treatment as a follow-up.🤖 Generated with Claude Code