Skip to content

feat: implement and register EtcdCluster admission webhooks#384

Open
xrl wants to merge 6 commits into
etcd-io:mainfrom
xrl:pr/admission-webhooks
Open

feat: implement and register EtcdCluster admission webhooks#384
xrl wants to merge 6 commits into
etcd-io:mainfrom
xrl:pr/admission-webhooks

Conversation

@xrl

@xrl xrl commented Jun 18, 2026

Copy link
Copy Markdown

Summary

The manager already constructs a webhook server (webhook.NewServer(...) in cmd/main.go) and the kustomize scaffolding carries [WEBHOOK]/[CERTMANAGER] markers, but no admission webhook was ever implemented or registered for EtcdCluster. The server started and served nothing, so malformed specs were only (partially) caught later during reconciliation, with no synchronous, actionable feedback at kubectl apply time.

This PR implements and registers a validating + defaulting admission webhook for EtcdCluster.

Refs #380

What it does

SetupWebhookWithManager is called from cmd/main.go (a minimal, localized edit, skippable with ENABLE_WEBHOOKS=false for local make run without serving certs).

Validating (ValidateCreate / ValidateUpdate)

Every rejection returns a crisp, actionable message telling the user exactly what is wrong and how to fix it:

  • spec.size — must be >= 1 and odd. An even-sized etcd cluster tolerates no more failures than the next-smaller odd size while being more likely to lose quorum; the message names the two nearest valid sizes.
  • spec.version — must be valid semver. On update, the current→target transition must be a supported single-minor, non-downgrade path, reusing the same ordered version table (etcd/api/v3/version.AllVersions) the controller uses at reconcile time.
  • spec.tlsprovider must be one of {auto, cert-manager}; the provider-specific config block must be present/complete (cert-manager requires issuerName, and issuerKind if set must be Issuer/ClusterIssuer); the auto provider must not carry cert-manager config.
  • spec.storageSpec — immutable once set (changing/removing the PVC template after the StatefulSet exists cannot be reconciled in place).

Defaulting (Default)

  • spec.tls.provider defaults to auto when a TLS block is present without an explicit provider (matches the documented behaviour of TLSCertificate.Provider).

Sample rejection messages

spec.size: Invalid value: 4: size must be an odd number so the cluster can form a
  majority quorum; got 4. An even-sized etcd cluster tolerates no more failures than
  the next-smaller odd size while being more likely to lose quorum. Use 3 or 5 instead.

spec.version: Invalid value: "latest": version "latest" is not a valid semantic version
  (expected MAJOR.MINOR.PATCH, e.g. "3.6.1"): latest is not in dotted-tri format.

spec.version: Invalid value: "3.7.1": upgrading from version 3.5.1 to version 3.7.1 is
  not allowed (skips a minor version). etcd only supports sequential single-minor
  upgrades and forbids downgrades; upgrade one minor version at a time
  (current "3.5.1" -> target "3.7.1").

spec.tls.provider: Unsupported value: "vault": supported values: "auto", "cert-manager"

Logging

The webhook logs every admission decision (admit / default / reject) with structured context — object name/namespace, operation, and per-field reason/remediation — so an operator reading the webhook-server logs can reconstruct exactly why any request was accepted or rejected.

Manifests

make manifests generates config/webhook/manifests.yaml; this PR adds the standard config/webhook (service + kustomizeconfig) and config/certmanager (self-signed Issuer + serving Certificate) scaffolds, the manager_webhook_patch.yaml, and enables the [WEBHOOK]/[CERTMANAGER] sections in config/default/kustomization.yaml including cert-manager CA injection. kustomize build config/default resolves the service name/namespace and cert-manager.io/inject-ca-from annotations correctly.

Testing

  • Unit (api/v1alpha1/etcdcluster_webhook_test.go): asserts the exact rejection-message text for size/version/TLS/upgrade/immutability and the defaulting behaviour.
  • Envtest (api/v1alpha1/webhook_envtest_test.go): installs the generated webhook config, runs a manager serving the webhook over the envtest-provisioned cert, applies invalid EtcdClusters against a real API server and asserts rejection + message; confirms defaulting and a valid single-minor upgrade are admitted.
  • E2E (test/e2e/webhook_test.go, new file): live apply-invalid assertions via the e2e-framework feature harness.

go build ./..., go vet ./..., golangci-lint, the unit tests, and the envtest webhook tests all pass; the e2e file compiles (go test -c ./test/e2e/). The e2e file is self-contained (its own helper) to avoid conflicts with other in-flight e2e PRs.

🤖 Generated with Claude Code

The manager wires a webhook server but never registered any webhook, so
malformed EtcdCluster specs were only (partially) caught at reconcile time
with no synchronous, actionable feedback at apply time (see etcd-io#380).

Add a validating + defaulting webhook for EtcdCluster and register it via
SetupWebhookWithManager in cmd/main.go (skippable with ENABLE_WEBHOOKS=false).

Validations (each rejection carries a crisp, actionable message):
- size: must be >= 1 and odd (etcd quorum); message names the next valid sizes.
- version: must be valid semver; on update, must be a supported single-minor,
  non-downgrade transition (reuses the controller's ordered version table).
- tls: provider must be one of {auto, cert-manager}; the matching providerCfg
  block must be present/complete; auto must not carry cert-manager config.
- storageSpec: immutable once set.

Defaulting:
- tls.provider defaults to "auto" when a TLS block is present without a provider.

Generate config/webhook + config/certmanager (cert-manager serving cert) and
enable the WEBHOOK/CERTMANAGER kustomize sections, including CA injection.

Tested: unit tests assert the exact rejection message text; envtest registers
the webhook with a test manager and applies invalid CRs to assert rejection +
message; a new test/e2e/webhook_test.go authors live apply-invalid assertions.

Refs etcd-io#380

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign hakman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

xrl and others added 5 commits June 18, 2026 20:01
…n to changed fields

Three correctness fixes to the EtcdCluster admission webhooks:

C1: storageSpec immutability used struct `!=` on a type embedding
resource.Quantity fields. `==`/`!=` compares the Quantity's cached string
and *inf.Dec pointer, not the numeric value, so a re-spelled but
numerically identical quantity (e.g. 1Gi vs 1024Mi) falsely tripped
"storageSpec is immutable" under failurePolicy: Fail. Replace with
apiequality.Semantic.DeepEqual (apimachinery, already a direct dep).
Verified empirically: 1Gi == 1024Mi is false under `==` but true under
Semantic.DeepEqual; 1Gi vs 2Gi remains unequal so immutability holds.

C2: the storage immutability test only set StorageClassName, leaving both
Quantities zero-valued, which is exactly the case where the buggy `==`
happens to agree. Populate VolumeSizeRequest in withStorage() and add two
cases: 1Gi vs 1024Mi must be allowed (C1 regression guard) and 1Gi vs 2Gi
must still be rejected.

C3: ValidateUpdate re-ran the full create-time spec validation, so with
failurePolicy: Fail and no objectSelector a pre-existing even-sized or
unknown-version cluster (the CRD historically only enforced size >= 1) was
rejected on every spec edit, with no way to remediate. Only re-check the
size and version-format invariants when those fields actually change;
always re-validate TLS coherence and the transition (immutability +
upgrade-path) checks. ValidateCreate is unchanged. Adds a test asserting
legacy even-sized clusters accept unrelated edits and can be remediated,
while a changed-but-still-even size is still rejected.

Also documents that validateUpgradePath checks the declared spec transition
only; the controller remains authoritative against the live image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…g, and spec validation

Completes the admission-webhook PR by making the generated webhook
configuration explicit and safe for production, verifying the
cert-manager caBundle injection chain renders end to end, and closing
remaining gaps in synchronous spec validation.

Webhook policy (generated via pinned controller-gen markers):
- failurePolicy=fail is now an explicit, documented decision. These
  webhooks are correctness-critical for a stateful quorum store, so they
  fail closed. The rule scope is exactly {operator.etcd.io/v1alpha1
  etcdclusters}, so a webhook outage never blocks core cluster objects;
  a documented break-glass objectSelector (label
  etcd.operator.etcd.io/skip-webhooks=true) lets an admin bypass
  admission for a single CR while the operator is down.
- timeoutSeconds=10 and matchPolicy=Equivalent added with rationale.

cert-manager caBundle injection:
- Added test/config render assertions that run `kustomize build
  config/default` and verify the inject-ca-from annotation points at the
  serving Certificate, the Certificate's dnsNames resolve to the webhook
  Service FQDN (no surviving placeholders), the issuerRef and mounted
  secret name line up, and both webhooks carry the hardened policy +
  break-glass selector.

Validation/defaulting completeness:
- validateStorageSpec mirrors the controller's reconcile-time storage
  invariants (volumeSizeRequest >= 1Mi and required, volumeSizeLimit >=
  request, supported accessModes, pvcName required for ReadWriteMany) so
  they are caught synchronously at apply time with actionable messages;
  also validated when storage is newly added on update.
- commonName is rejected when it exceeds the documented 64-char X.509 CN
  ceiling for both auto and cert-manager providers.
- New unit tests assert the exact rejection-message text; new envtest +
  compile-only e2e cases cover undersized storage.

Deferred (intentionally not built): a conversion webhook. It is premature
while only v1alpha1 exists; matchPolicy=Equivalent already guards against
a future equivalent group/version slipping past the validator.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…me up

The admission-webhook feature wires config/default to ../certmanager: a
cert-manager Certificate/Issuer issues the webhook-server-cert secret and
manager_webhook_patch.yaml mounts it into the manager pod. The e2e suite's
TestMain installed cert-manager only lazily inside an individual feature test
that runs AFTER setup, so 'make deploy DEPLOY_MODE=e2e' ran with no
cert-manager present: the Certificate/Issuer had no CRDs to apply against, the
secret never materialized, the manager pod hung in ContainerCreating, and the
DeploymentAvailable wait timed out before any webhook assertion could run.

Install cert-manager (InstallCertManager blocks until cert-manager-webhook is
Available) as a setup step BEFORE the deploy step. Verified live: without this
the manager never becomes Available; with it the controller-manager is Available
~14s after deploy and all webhook assertions pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
The webhook e2e fed the suite-wide etcdVersion ("v3.6.12", v-prefixed) into
every case. semver.NewVersion("v3.6.12") errors, so the new version webhook
rejects it: the happy-path 'admits valid cluster' assertion failed outright,
and the size/tls/storage reject cases passed for the WRONG reason (the
aggregated NewInvalid also carried a version error, so a substring match no
longer isolated the invariant under test).

Improvements:
- Use a hardcoded valid semver (3.6.1) for every non-version case, matching the
  envtest twin, so each assertion proves exactly its one invariant.
- requireWebhookRejection helper: assert apierrors.IsInvalid (a real 422 from
  the validating webhook, not a CRD/schema rejection), assert the exact
  spec.<field> path, and assert that unrelated invariants are NOT present
  (isolation) -- this is the check that kills the 'passes for the wrong reason'
  failure mode.
- Tighten message asserts to the envtest twin's exact text: 'Use 1 or 3
  instead' for even size, supported values: "auto", "cert-manager" for the
  bad provider (catches a garbled NotSupported list a loose Contains pair would
  miss).
- Add the break-glass objectSelector proof: a skip-webhooks-labeled even-sized
  cluster is admitted (bypass) while the same spec unlabeled is still rejected.
  This is the one behavior only a live API server can prove (render test is
  static, envtest has no objectSelector) and the most dangerous new knob.
- Prove the positive upgrade path (3.5.1 -> 3.6.1 admitted) after the skip-minor
  rejection, so a reject-all-updates regression cannot hide behind the negative
  case.
- Use size 1 for the two clusters that are actually created (still odd/valid) to
  cut reconcile load on the kind node.

No sleeps/polling added: admission is synchronous, so a Create/Update return IS
the decision. Live: all 7 sub-asserts green; suite wall-clock ~127-231s, the
8 assertions themselves run in ~0.3s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…-tagged images

The admission webhook validated spec.version with coreos/go-semver, which
rejects a leading "v". But the default registry (gcr.io/etcd-development/etcd)
publishes ONLY v-prefixed tags, and the controller built the image as
"<registry>:<version>". So with the webhook on:

  - "v3.6.1" (the form config/samples actually ships -> v3.5.21, matching
    origin/main) was REJECTED by the validator, and
  - a bare "3.6.1" that the validator did accept rendered
    "gcr.io/etcd-development/etcd:3.6.1", a tag the registry does not publish
    -> NotFound / ImagePullBackOff.

The default image was therefore unpullable whenever the webhook was enabled.

Fix, end to end:

  - Webhook NORMALIZES spec.version: a leading "v" is stripped before
    go-semver parses it, so BOTH "v3.6.1" and "3.6.1" validate, including the
    upgrade-path comparison (validateUpgradePath/checkUpgradePath) and the
    no-op short-circuit. This keeps the operator's own v-prefixed samples
    appliable rather than self-rejecting.
  - Controller NORMALIZES the rendered tag to the registry convention
    (internal/controller/utils.go: etcdImageTag) by prepending "v" when the
    user supplied a bare semver, so a bare spec.version still pulls. An
    already-v-prefixed value and an explicit non-semver custom tag are passed
    through untouched so non-default registries are not disturbed.
  - Controller's reconcile-time upgrade check now compares and parses on the
    normalized form, so the v-prefixed live image tag and a bare spec.version
    for the same release are not mistaken for a version change (and parse
    cleanly through validateEtcdUpgradePath).

Tests: webhook accepts "v3.6.1" and "3.6.1" and a v-prefixed single-minor
upgrade path (skip-minor still rejected); etcdImageTag and an end-to-end
StatefulSet render assert the tag is v-prefixed for the default registry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants