feat: configurable reconcile worker pool + Burstable etcd QoS#379
Conversation
Add a --max-concurrent-reconciles flag (default 5) and thread it into
SetupWithManager via builder.WithOptions(controller.Options{...}).
controller-runtime defaults to a single reconcile worker. Each EtcdCluster
is reconciled on its own workqueue key (deduped by namespaced name), so a
cluster is never reconciled by two workers at once -- concurrency only
parallelizes distinct clusters, with no intra-cluster races. Reconciles here
are heavy and long-running (StatefulSet patches, member-list/health RPCs
against managed etcd, certificate work), so a small pool improves
multi-cluster throughput; the cost of a larger pool is more simultaneous
apiserver and managed-etcd load. A value <= 0 falls back to the safe default
of 1.
Document the flag in its help text, a doc comment on the reconciler field,
and a new docs/operator-flags.md (linked from the README). Add an
envtest-backed test asserting SetupWithManager threads the pool size for
widened, single-worker, and non-positive fallback values.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Set a CPU *request* (never a limit) of 50m on the etcd container so the pod is Burstable instead of BestEffort. Without a request, etcd lands in BestEffort: its cgroup gets the kernel-floor cpu.shares of 2 and it is the first workload evicted under node pressure. A small request lifts it to Burstable, raises cpu.shares to ~51 (a scheduling floor that only bites under contention), and -- being a request -- never throttles etcd. The value is overridable via a new --etcd-cpu-request flag (default "50m") so the effect can be A/B-measured and tuned per fleet. An empty string or "0" applies no request, restoring the original BestEffort behavior; a malformed quantity is treated as unset so a bad flag cannot wedge cluster creation. This is a controller-level knob (identical for every cluster) rather than a CRD field, so it requires no API change. Document it in the flag help, the package var, and docs/operator-flags.md (linked from the README). Add a table-driven test asserting the etcd container carries the 50m request when enabled (and no CPU limit), an overridden value is honored, and empty/"0"/malformed disable it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
Document why the stress e2e runs as three batches, grounded in salvaged spinup-burst measurements (10-CPU/8GB Docker VM): - Tier 1 (size 1-3): a full size-7 bring-up is only ~8 CPU-s, front-loaded on the bootstrap member, so small clusters parallelize freely. - Tier 2 (size 7, churn): 4 concurrent size-7 spinups peak at 12.34 busy cores on a 10-core VM (~1.4-3 cores each) -> only ~1 fits a 2-vCPU runner; --max-concurrent-reconciles, not namespace isolation, is the overlap lever. - Tier 3 (TestStressCrashDuringScale): gofail failpoints are operator-global (one pod), so a crash test must run alone. Also documents the 50m etcd CPU request (cpu.shares 2 -> ~51, request not limit so zero throttling) with the honest caveat that the 10-core VM could not reproduce 2-vCPU contention, and a core-constrained re-run is needed to demonstrate the QoS benefit. No throttling observed in any run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xrl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @xrl. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What
Two independent controller-level operability knobs (no API/CRD change), plus the stress-tier rationale doc.
--max-concurrent-reconciles(default 5)controller-runtime defaults to a single reconcile worker. Each
EtcdClusteris reconciled on its own workqueue key (deduped by namespaced name), so a cluster is never reconciled by two workers at once — concurrency only parallelizes distinct clusters, with no intra-cluster races. Reconciles here are heavy and long-running (StatefulSet patches, member-list/health RPCs against managed etcd, certificate work), so a small pool improves multi-cluster throughput; the cost of a larger pool is more simultaneous apiserver and managed-etcd load. A value<= 0falls back to the safe default of 1. Threaded intoSetupWithManagerviabuilder.WithOptions.--etcd-cpu-request(default50m) — Burstable QoSSet a CPU request (never a limit) of
50mon the etcd container so the pod is Burstable instead of BestEffort. Without a request etcd lands in BestEffort: its cgroup gets the kernel-floorcpu.sharesof 2 and it is first evicted under node pressure. A small request lifts it to Burstable, raisescpu.sharesto ~51 (a scheduling floor that only bites under contention), and — being a request — never throttles etcd. Empty/"0"/malformed ⇒ no request (restores BestEffort); a malformed quantity is treated as unset so a bad flag can't wedge cluster creation.Both flags are documented in their help text, a doc comment, and a new
docs/operator-flags.md(linked from the README). Tests: envtest-backed assertion thatSetupWithManagerthreads the pool size (widened / single-worker / non-positive fallback); table-driven assertion that the etcd container carries the50mrequest when enabled (and no CPU limit), honors an override, and disables on empty/"0"/malformed.test/e2e/STRESS.mddocuments why the stress e2e runs as three batches, grounded in salvaged spinup-burst measurements (honest caveat: the 10-core VM could not reproduce 2-vCPU contention, so the QoS benefit needs a core-constrained re-run to demonstrate).Independence
Independent of the TLS series (#376–#378) and of the stress harness (#369) — reviewable and mergeable on its own.
Related work — etcd TLS & operability
Independent peer/client TLS reshape and surrounding operability work, in dependency / stacking order (
→marks this PR):spec.tls.{peer,client}surfaces; breaking alpha API change (no conversion webhook, by design)TLSReadycondition + TLS lifecycle EventsPeerCANotSharedThe TLS reshape (#376) supersedes the earlier conflated T2/T3/T4 plan (per-surface mounts, flags+scheme, and client
*tls.Confignow all live in #376). T5←#376 and T6←#377 are stacked: review/merge in order. T0, the reconcile/QoS knobs, and the stress harness are independent.