Skip to content

feat: configurable reconcile worker pool + Burstable etcd QoS#379

Open
xrl wants to merge 3 commits into
etcd-io:mainfrom
xrl:pr/reconcile-pool-and-etcd-qos
Open

feat: configurable reconcile worker pool + Burstable etcd QoS#379
xrl wants to merge 3 commits into
etcd-io:mainfrom
xrl:pr/reconcile-pool-and-etcd-qos

Conversation

@xrl

@xrl xrl commented Jun 18, 2026

Copy link
Copy Markdown

What

Two independent controller-level operability knobs (no API/CRD change), plus the stress-tier rationale doc.

--max-concurrent-reconciles (default 5)

controller-runtime defaults to a single reconcile worker. Each EtcdCluster is reconciled on its own workqueue key (deduped by namespaced name), so a cluster is never reconciled by two workers at once — concurrency only parallelizes distinct clusters, with no intra-cluster races. Reconciles here are heavy and long-running (StatefulSet patches, member-list/health RPCs against managed etcd, certificate work), so a small pool improves multi-cluster throughput; the cost of a larger pool is more simultaneous apiserver and managed-etcd load. A value <= 0 falls back to the safe default of 1. Threaded into SetupWithManager via builder.WithOptions.

--etcd-cpu-request (default 50m) — Burstable QoS

Set a CPU request (never a limit) of 50m on the etcd container so the pod is Burstable instead of BestEffort. Without a request etcd lands in BestEffort: its cgroup gets the kernel-floor cpu.shares of 2 and it is first evicted under node pressure. A small request lifts it to Burstable, raises cpu.shares to ~51 (a scheduling floor that only bites under contention), and — being a request — never throttles etcd. Empty/"0"/malformed ⇒ no request (restores BestEffort); a malformed quantity is treated as unset so a bad flag can't wedge cluster creation.

Both flags are documented in their help text, a doc comment, and a new docs/operator-flags.md (linked from the README). Tests: envtest-backed assertion that SetupWithManager threads the pool size (widened / single-worker / non-positive fallback); table-driven assertion that the etcd container carries the 50m request when enabled (and no CPU limit), honors an override, and disables on empty/"0"/malformed.

test/e2e/STRESS.md documents why the stress e2e runs as three batches, grounded in salvaged spinup-burst measurements (honest caveat: the 10-core VM could not reproduce 2-vCPU contention, so the QoS benefit needs a core-constrained re-run to demonstrate).

Independence

Independent of the TLS series (#376#378) and of the stress harness (#369) — reviewable and mergeable on its own.


Related work — etcd TLS & operability

Independent peer/client TLS reshape and surrounding operability work, in dependency / stacking order ( marks this PR):

Change Issue PR Depends on
TLS independence — independent spec.tls.{peer,client} surfaces; breaking alpha API change (no conversion webhook, by design) #371 #372 #373
TLSReady condition + TLS lifecycle Events #376
multi-member TLS quorum e2e + PeerCANotShared #377
stop swallowing the client-certificate error (requeue) #370
configurable reconcile worker pool + Burstable etcd QoS
kind stress/scale e2e harness (1/3/7, churn, quorum watcher)

The TLS reshape (#376) supersedes the earlier conflated T2/T3/T4 plan (per-surface mounts, flags+scheme, and client *tls.Config now all live in #376). T5←#376 and T6←#377 are stacked: review/merge in order. T0, the reconcile/QoS knobs, and the stress harness are independent.

xrl and others added 3 commits June 18, 2026 11:00
Add a --max-concurrent-reconciles flag (default 5) and thread it into
SetupWithManager via builder.WithOptions(controller.Options{...}).

controller-runtime defaults to a single reconcile worker. Each EtcdCluster
is reconciled on its own workqueue key (deduped by namespaced name), so a
cluster is never reconciled by two workers at once -- concurrency only
parallelizes distinct clusters, with no intra-cluster races. Reconciles here
are heavy and long-running (StatefulSet patches, member-list/health RPCs
against managed etcd, certificate work), so a small pool improves
multi-cluster throughput; the cost of a larger pool is more simultaneous
apiserver and managed-etcd load. A value <= 0 falls back to the safe default
of 1.

Document the flag in its help text, a doc comment on the reconciler field,
and a new docs/operator-flags.md (linked from the README). Add an
envtest-backed test asserting SetupWithManager threads the pool size for
widened, single-worker, and non-positive fallback values.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Set a CPU *request* (never a limit) of 50m on the etcd container so the pod
is Burstable instead of BestEffort. Without a request, etcd lands in
BestEffort: its cgroup gets the kernel-floor cpu.shares of 2 and it is the
first workload evicted under node pressure. A small request lifts it to
Burstable, raises cpu.shares to ~51 (a scheduling floor that only bites under
contention), and -- being a request -- never throttles etcd.

The value is overridable via a new --etcd-cpu-request flag (default "50m")
so the effect can be A/B-measured and tuned per fleet. An empty string or
"0" applies no request, restoring the original BestEffort behavior; a
malformed quantity is treated as unset so a bad flag cannot wedge cluster
creation. This is a controller-level knob (identical for every cluster)
rather than a CRD field, so it requires no API change.

Document it in the flag help, the package var, and docs/operator-flags.md
(linked from the README). Add a table-driven test asserting the etcd
container carries the 50m request when enabled (and no CPU limit), an
overridden value is honored, and empty/"0"/malformed disable it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Document why the stress e2e runs as three batches, grounded in salvaged
spinup-burst measurements (10-CPU/8GB Docker VM):

- Tier 1 (size 1-3): a full size-7 bring-up is only ~8 CPU-s, front-loaded
  on the bootstrap member, so small clusters parallelize freely.
- Tier 2 (size 7, churn): 4 concurrent size-7 spinups peak at 12.34 busy
  cores on a 10-core VM (~1.4-3 cores each) -> only ~1 fits a 2-vCPU runner;
  --max-concurrent-reconciles, not namespace isolation, is the overlap lever.
- Tier 3 (TestStressCrashDuringScale): gofail failpoints are operator-global
  (one pod), so a crash test must run alone.

Also documents the 50m etcd CPU request (cpu.shares 2 -> ~51, request not
limit so zero throttling) with the honest caveat that the 10-core VM could
not reproduce 2-vCPU contention, and a core-constrained re-run is needed to
demonstrate the QoS benefit. No throttling observed in any run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign ivanvc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants