Skip to content

feat(status): add TLSReady condition and TLS lifecycle events#377

Open
xrl wants to merge 6 commits into
etcd-io:mainfrom
xrl:pr/T5-tls-conditions
Open

feat(status): add TLSReady condition and TLS lifecycle events#377
xrl wants to merge 6 commits into
etcd-io:mainfrom
xrl:pr/T5-tls-conditions

Conversation

@xrl

@xrl xrl commented Jun 18, 2026

Copy link
Copy Markdown

What

Surface the TLS lifecycle on the EtcdCluster CR. The independence reshape (#376) wired the EventRecorder and the buildClientTLSConfig/validateTLS surfaces but emitted zero Events and had no TLS status condition — so a misconfigured TLS spec failed silently. This closes that observability gap.

  • TLSReady status condition in updateConditions, reflecting both peer and client surfaces with a per-surface reason vocabulary: TLSNotConfigured (omitted when both surfaces nil), TLSReady, IssuerNotFound, PeerCANotShared, ClientServerCAMismatch, ClientCertificateError, SurfaceNotReady. False with the specific reason on each failure; True only when every configured surface is healthy.
  • CR Events at the controller boundary (Recorder-free helpers return a typed verdict the boundary maps to Events): Warning on each failure with the matching reason, a single Normal TLSReady on transition into ready.
  • Implements the runtime checks the reshape deferred (not CEL-expressible): PeerCANotShared detects a self-signed-leaf cert-manager peer issuer that cannot establish shared peer trust — turning a silent cap-at-1-member hazard into a visible signal; client-surface issuer existence and operator-client secret usability are checked from cluster objects. ClientServerCAMismatch is in the vocabulary but unreachable from a single client surface (server + operator-client share one issuer by construction) — left as a documented TODO.

Unit tests drive every state→reason mapping, the peer-CA predicate against representative issuer inputs, and Event emission via record.FakeRecorder. go build/go vet clean.

Stacking

Stacked on #376 (independent peer/client TLS). Review/merge after #376; this branch contains #376's commits and will shrink to its own two commits once #376 merges.


Related work — etcd TLS & operability

Independent peer/client TLS reshape and surrounding operability work, in dependency / stacking order ( marks this PR):

Change Issue PR Depends on
TLS independence — independent spec.tls.{peer,client} surfaces; breaking alpha API change (no conversion webhook, by design) #371 #372 #373
TLSReady condition + TLS lifecycle Events #376
multi-member TLS quorum e2e + PeerCANotShared #377
stop swallowing the client-certificate error (requeue) #370
configurable reconcile worker pool + Burstable etcd QoS
kind stress/scale e2e harness (1/3/7, churn, quorum watcher)

The TLS reshape (#376) supersedes the earlier conflated T2/T3/T4 plan (per-surface mounts, flags+scheme, and client *tls.Config now all live in #376). T5←#376 and T6←#377 are stacked: review/merge in order. T0, the reconcile/QoS knobs, and the stress harness are independent.

xrl and others added 6 commits June 17, 2026 22:20
etcd has three TLS surfaces -- peer<->peer, client->server, and the
operator's own client identity -- but the alpha CRD conflated all three
behind a single `spec.tls` toggle and a single shared issuer, so a user
could neither serve peer TLS without client TLS (or vice versa) nor relax
mutual client-cert auth per surface. The single predicate also forced
`--client-cert-auth`/`--peer-client-cert-auth` on whenever TLS was set.

Reshape `spec.tls` into two optional, independent surfaces -- `peer` and
`client` -- each carrying its own provider, issuer, and `clientCertAuth`
policy (default true). A nil surface serves/dials that surface in
cleartext; both nil is fully cleartext and byte-identical to the pre-TLS
path. The operator's own client identity follows the `client` surface.

This is a clean break on the alpha API (no conversion webhook, which would
re-introduce the conflation). The dead `caBundleSecret` knob is removed
(type + cert interface; it was read nowhere). `IssuerGroup` is added to the
cert-manager provider config (empty => cert-manager.io) so non-default
issuer groups can be targeted.

Threading (single source of truth, per surface):
- peerScheme/clientScheme drive peer vs client URLs and flags independently
- defaultArgs emits the server flag group iff client set, the peer flag
  group iff peer set, and each --*client-cert-auth gated on its surface
- cert mounts: server secret iff client set, peer secret iff peer set
  (append-not-assign so they coexist with the storage data-dir mount)
- cert provisioning: server+operator-client certs from the client surface,
  peer cert from the peer surface
- etcdutils MemberList/ClusterHealth/AddMember/PromoteLearner/RemoveMember
  take an optional *tls.Config (nil => cleartext); the operator builds it
  from the client surface only, with a missing ca.crt now an explicit error

Anti-misconfiguration guardrails:
- apply-time CEL XValidation on each surface: provider/providerCfg
  coherence, mTLS-requires-a-CA, issuerKind/provider enums
- a reconcile-time validateTLS backstop re-checks the spec-only rules and
  requeues on an invalid spec (for apiservers that don't enforce CEL)

Also: the cert-site log.Printf calls move to the reconcile-scoped logger,
and the "running without TLS" error-level log is demoted to Info (cleartext
is a supported mode). The sample is rewritten to the two-tier shared-CA
issuer pattern a multi-member peer-mTLS cluster actually needs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
GetCertificateConfig now echoes IssuerRef.Group into ExtraConfig under
the issuerGroup key. The TestCertManagerProvider "Get certificate config"
assertion built its expected config without that key, so reflect.DeepEqual
failed. Add the empty-string issuerGroup to the expected config (no group
set => cert-manager leaves Group == "").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
A peer-TLS cluster could not form past the seed member. With
--peer-client-cert-auth, etcd verifies a joining peer's source IP against
its cert SANs via isHostInDNS (client/pkg transport), which forward-resolves
the cluster's headless-service DNS name. A joining member is not Ready until
it has actually joined the cluster, so without publishNotReadyAddresses its
pod IP is absent from the service's A record, the peer cert SAN check fails
("tls: <ip> does not match any of DNSNames"), the peer handshake is rejected
with EOF, and the joining member crashloops -- deadlocking the cluster at one
member.

Set PublishNotReadyAddresses=true on the headless service so not-yet-Ready
members are resolvable and peer mTLS can complete. Cleartext peer traffic does
not hit this verification path, so non-TLS clusters are unaffected.

Discovered by the T6 multi-member TLS e2e (first live exercise of multi-member
peer mTLS); verified on kind: 3-member both-surface TLS cluster reaches 3
voting members with this change and crashloops at 1 without it.

Belongs to: pr/tls-independence

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…aviour

Fill the two genuine documentation gaps the reshape left:

- spec.tls is effectively create-time; toggling it on a running cluster
  drops quorum. Document the create-new-not-flip caveat on the
  EtcdClusterSpec.TLS field where a spec author hits it (regenerated CRD).
- buildClientTLSConfig sets no ServerName, so verification relies on the
  dialed pod FQDN matching the server cert SANs; custom AltNames must keep
  covering *.{name}.{ns}.svc.cluster.local.

Comments only, no behaviour change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Surface the TLS lifecycle on the EtcdCluster CR. The independence reshape
wired the EventRecorder and the buildClientTLSConfig/validateTLS surfaces
but emitted zero Events and had no TLS status condition; this closes that
observability gap (adversarial review section 5).

- Add a single TLSReady status condition in updateConditions, reflecting
  both peer and client surfaces with a per-surface reason vocabulary:
  TLSNotConfigured (omitted when both surfaces nil), TLSReady,
  IssuerNotFound, PeerCANotShared, ClientServerCAMismatch,
  ClientCertificateError, SurfaceNotReady. False with the specific reason
  on each failure; True only when every configured surface is healthy.

- Emit CR Events at the controller boundary (Recorder-free helpers return a
  typed verdict the boundary maps to Events): Warning on each failure with
  the matching reason, a single Normal TLSReady on transition into ready.

- Implement the runtime checks the reshape deferred (not CEL-expressible):
  PeerCANotShared detects a self-signed-leaf cert-manager peer issuer that
  cannot establish shared peer trust, turning the silent cap-at-1-member
  hazard into a visible signal; client-surface issuer existence and
  operator-client secret usability are checked from cluster objects.
  ClientServerCAMismatch is in the vocabulary but unreachable from a single
  client surface (server + operator-client share one issuer by
  construction); left as a documented TODO.

Unit tests drive every state->reason mapping, the peer-CA predicate against
representative issuer inputs, and Event emission via record FakeRecorder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…ason reuse

Document that PeerCANotShared is a conservative WARNING rather than a hard
member cap: the operator mounts ONE shared peer secret into every pod, so a
self-signed-leaf peer cert still forms a multi-member quorum. Note the reuse
of the ClientCertificateError reason on the validateTLS boundary, which runs
before evaluateTLSReadiness and therefore has no per-surface verdict reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants