feat: configurable reconcile worker pool + Burstable etcd QoS by xrl · Pull Request #379 · etcd-io/etcd-operator

xrl · 2026-06-18T17:55:35Z

What

Two independent controller-level operability knobs (no API/CRD change), plus the stress-tier rationale doc.

`--max-concurrent-reconciles` (default 5)

controller-runtime defaults to a single reconcile worker. Each EtcdCluster is reconciled on its own workqueue key (deduped by namespaced name), so a cluster is never reconciled by two workers at once — concurrency only parallelizes distinct clusters, with no intra-cluster races. Reconciles here are heavy and long-running (StatefulSet patches, member-list/health RPCs against managed etcd, certificate work), so a small pool improves multi-cluster throughput; the cost of a larger pool is more simultaneous apiserver and managed-etcd load. A value <= 0 falls back to the safe default of 1. Threaded into SetupWithManager via builder.WithOptions.

`--etcd-cpu-request` (default `50m`) — Burstable QoS

Set a CPU request (never a limit) of 50m on the etcd container so the pod is Burstable instead of BestEffort. Without a request etcd lands in BestEffort: its cgroup gets the kernel-floor cpu.shares of 2 and it is first evicted under node pressure. A small request lifts it to Burstable, raises cpu.shares to ~51 (a scheduling floor that only bites under contention), and — being a request — never throttles etcd. Empty/"0"/malformed ⇒ no request (restores BestEffort); a malformed quantity is treated as unset so a bad flag can't wedge cluster creation.

Both flags are documented in their help text, a doc comment, and a new docs/operator-flags.md (linked from the README). Tests: envtest-backed assertion that SetupWithManager threads the pool size (widened / single-worker / non-positive fallback); table-driven assertion that the etcd container carries the 50m request when enabled (and no CPU limit), honors an override, and disables on empty/"0"/malformed.

test/e2e/STRESS.md documents why the stress e2e runs as three batches, grounded in salvaged spinup-burst measurements (honest caveat: the 10-core VM could not reproduce 2-vCPU contention, so the QoS benefit needs a core-constrained re-run to demonstrate).

Independence

Independent of the TLS series (#376–#378) and of the stress harness (#369) — reviewable and mergeable on its own.

Related work — etcd TLS & operability

Independent peer/client TLS reshape and surrounding operability work, in dependency / stacking order (→ marks this PR):

	Change	Issue	PR
	TLS independence — independent `spec.tls.{peer,client}` surfaces; breaking alpha API change (no conversion webhook, by design)	#371 #372 #373	—
	`TLSReady` condition + TLS lifecycle Events	—	#376
	multi-member TLS quorum e2e + `PeerCANotShared`	—	#377
	stop swallowing the client-certificate error (requeue)	#370	—
→	configurable reconcile worker pool + Burstable etcd QoS	—	—
	kind stress/scale e2e harness (1/3/7, churn, quorum watcher)	—	—

The TLS reshape (#376) supersedes the earlier conflated T2/T3/T4 plan (per-surface mounts, flags+scheme, and client *tls.Config now all live in #376). T5←#376 and T6←#377 are stacked: review/merge in order. T0, the reconcile/QoS knobs, and the stress harness are independent.

Add a --max-concurrent-reconciles flag (default 5) and thread it into SetupWithManager via builder.WithOptions(controller.Options{...}). controller-runtime defaults to a single reconcile worker. Each EtcdCluster is reconciled on its own workqueue key (deduped by namespaced name), so a cluster is never reconciled by two workers at once -- concurrency only parallelizes distinct clusters, with no intra-cluster races. Reconciles here are heavy and long-running (StatefulSet patches, member-list/health RPCs against managed etcd, certificate work), so a small pool improves multi-cluster throughput; the cost of a larger pool is more simultaneous apiserver and managed-etcd load. A value <= 0 falls back to the safe default of 1. Document the flag in its help text, a doc comment on the reconciler field, and a new docs/operator-flags.md (linked from the README). Add an envtest-backed test asserting SetupWithManager threads the pool size for widened, single-worker, and non-positive fallback values. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

Set a CPU *request* (never a limit) of 50m on the etcd container so the pod is Burstable instead of BestEffort. Without a request, etcd lands in BestEffort: its cgroup gets the kernel-floor cpu.shares of 2 and it is the first workload evicted under node pressure. A small request lifts it to Burstable, raises cpu.shares to ~51 (a scheduling floor that only bites under contention), and -- being a request -- never throttles etcd. The value is overridable via a new --etcd-cpu-request flag (default "50m") so the effect can be A/B-measured and tuned per fleet. An empty string or "0" applies no request, restoring the original BestEffort behavior; a malformed quantity is treated as unset so a bad flag cannot wedge cluster creation. This is a controller-level knob (identical for every cluster) rather than a CRD field, so it requires no API change. Document it in the flag help, the package var, and docs/operator-flags.md (linked from the README). Add a table-driven test asserting the etcd container carries the 50m request when enabled (and no CPU limit), an overridden value is honored, and empty/"0"/malformed disable it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

Document why the stress e2e runs as three batches, grounded in salvaged spinup-burst measurements (10-CPU/8GB Docker VM): - Tier 1 (size 1-3): a full size-7 bring-up is only ~8 CPU-s, front-loaded on the bootstrap member, so small clusters parallelize freely. - Tier 2 (size 7, churn): 4 concurrent size-7 spinups peak at 12.34 busy cores on a 10-core VM (~1.4-3 cores each) -> only ~1 fits a 2-vCPU runner; --max-concurrent-reconciles, not namespace isolation, is the overlap lever. - Tier 3 (TestStressCrashDuringScale): gofail failpoints are operator-global (one pod), so a crash test must run alone. Also documents the 50m etcd CPU request (cpu.shares 2 -> ~51, request not limit so zero throttling) with the honest caveat that the 10-core VM could not reproduce 2-vCPU contention, and a core-constrained re-run is needed to demonstrate the QoS benefit. No throttling observed in any run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

k8s-ci-robot · 2026-06-18T17:55:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign ivanvc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-06-18T17:55:46Z

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xrl and others added 3 commits June 18, 2026 11:00

k8s-ci-robot added the needs-ok-to-test label Jun 18, 2026

k8s-ci-robot added the size/L label Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: configurable reconcile worker pool + Burstable etcd QoS#379

feat: configurable reconcile worker pool + Burstable etcd QoS#379
xrl wants to merge 3 commits into
etcd-io:mainfrom
xrl:pr/reconcile-pool-and-etcd-qos

xrl commented Jun 18, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xrl commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

--max-concurrent-reconciles (default 5)

--etcd-cpu-request (default 50m) — Burstable QoS

Independence

Related work — etcd TLS & operability

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xrl commented Jun 18, 2026 •

edited

Loading

`--max-concurrent-reconciles` (default 5)

`--etcd-cpu-request` (default `50m`) — Burstable QoS