feat(status): add TLSReady condition and TLS lifecycle events#377
Conversation
etcd has three TLS surfaces -- peer<->peer, client->server, and the operator's own client identity -- but the alpha CRD conflated all three behind a single `spec.tls` toggle and a single shared issuer, so a user could neither serve peer TLS without client TLS (or vice versa) nor relax mutual client-cert auth per surface. The single predicate also forced `--client-cert-auth`/`--peer-client-cert-auth` on whenever TLS was set. Reshape `spec.tls` into two optional, independent surfaces -- `peer` and `client` -- each carrying its own provider, issuer, and `clientCertAuth` policy (default true). A nil surface serves/dials that surface in cleartext; both nil is fully cleartext and byte-identical to the pre-TLS path. The operator's own client identity follows the `client` surface. This is a clean break on the alpha API (no conversion webhook, which would re-introduce the conflation). The dead `caBundleSecret` knob is removed (type + cert interface; it was read nowhere). `IssuerGroup` is added to the cert-manager provider config (empty => cert-manager.io) so non-default issuer groups can be targeted. Threading (single source of truth, per surface): - peerScheme/clientScheme drive peer vs client URLs and flags independently - defaultArgs emits the server flag group iff client set, the peer flag group iff peer set, and each --*client-cert-auth gated on its surface - cert mounts: server secret iff client set, peer secret iff peer set (append-not-assign so they coexist with the storage data-dir mount) - cert provisioning: server+operator-client certs from the client surface, peer cert from the peer surface - etcdutils MemberList/ClusterHealth/AddMember/PromoteLearner/RemoveMember take an optional *tls.Config (nil => cleartext); the operator builds it from the client surface only, with a missing ca.crt now an explicit error Anti-misconfiguration guardrails: - apply-time CEL XValidation on each surface: provider/providerCfg coherence, mTLS-requires-a-CA, issuerKind/provider enums - a reconcile-time validateTLS backstop re-checks the spec-only rules and requeues on an invalid spec (for apiservers that don't enforce CEL) Also: the cert-site log.Printf calls move to the reconcile-scoped logger, and the "running without TLS" error-level log is demoted to Info (cleartext is a supported mode). The sample is rewritten to the two-tier shared-CA issuer pattern a multi-member peer-mTLS cluster actually needs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
GetCertificateConfig now echoes IssuerRef.Group into ExtraConfig under the issuerGroup key. The TestCertManagerProvider "Get certificate config" assertion built its expected config without that key, so reflect.DeepEqual failed. Add the empty-string issuerGroup to the expected config (no group set => cert-manager leaves Group == ""). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
A peer-TLS cluster could not form past the seed member. With
--peer-client-cert-auth, etcd verifies a joining peer's source IP against
its cert SANs via isHostInDNS (client/pkg transport), which forward-resolves
the cluster's headless-service DNS name. A joining member is not Ready until
it has actually joined the cluster, so without publishNotReadyAddresses its
pod IP is absent from the service's A record, the peer cert SAN check fails
("tls: <ip> does not match any of DNSNames"), the peer handshake is rejected
with EOF, and the joining member crashloops -- deadlocking the cluster at one
member.
Set PublishNotReadyAddresses=true on the headless service so not-yet-Ready
members are resolvable and peer mTLS can complete. Cleartext peer traffic does
not hit this verification path, so non-TLS clusters are unaffected.
Discovered by the T6 multi-member TLS e2e (first live exercise of multi-member
peer mTLS); verified on kind: 3-member both-surface TLS cluster reaches 3
voting members with this change and crashloops at 1 without it.
Belongs to: pr/tls-independence
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…aviour
Fill the two genuine documentation gaps the reshape left:
- spec.tls is effectively create-time; toggling it on a running cluster
drops quorum. Document the create-new-not-flip caveat on the
EtcdClusterSpec.TLS field where a spec author hits it (regenerated CRD).
- buildClientTLSConfig sets no ServerName, so verification relies on the
dialed pod FQDN matching the server cert SANs; custom AltNames must keep
covering *.{name}.{ns}.svc.cluster.local.
Comments only, no behaviour change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Surface the TLS lifecycle on the EtcdCluster CR. The independence reshape wired the EventRecorder and the buildClientTLSConfig/validateTLS surfaces but emitted zero Events and had no TLS status condition; this closes that observability gap (adversarial review section 5). - Add a single TLSReady status condition in updateConditions, reflecting both peer and client surfaces with a per-surface reason vocabulary: TLSNotConfigured (omitted when both surfaces nil), TLSReady, IssuerNotFound, PeerCANotShared, ClientServerCAMismatch, ClientCertificateError, SurfaceNotReady. False with the specific reason on each failure; True only when every configured surface is healthy. - Emit CR Events at the controller boundary (Recorder-free helpers return a typed verdict the boundary maps to Events): Warning on each failure with the matching reason, a single Normal TLSReady on transition into ready. - Implement the runtime checks the reshape deferred (not CEL-expressible): PeerCANotShared detects a self-signed-leaf cert-manager peer issuer that cannot establish shared peer trust, turning the silent cap-at-1-member hazard into a visible signal; client-surface issuer existence and operator-client secret usability are checked from cluster objects. ClientServerCAMismatch is in the vocabulary but unreachable from a single client surface (server + operator-client share one issuer by construction); left as a documented TODO. Unit tests drive every state->reason mapping, the peer-CA predicate against representative issuer inputs, and Event emission via record FakeRecorder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
…ason reuse Document that PeerCANotShared is a conservative WARNING rather than a hard member cap: the operator mounts ONE shared peer secret into every pod, so a self-signed-leaf peer cert still forms a multi-member quorum. Note the reuse of the ClientCertificateError reason on the validateTLS boundary, which runs before evaluateTLSReadiness and therefore has no per-surface verdict reason. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xrl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @xrl. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What
Surface the TLS lifecycle on the
EtcdClusterCR. The independence reshape (#376) wired theEventRecorderand thebuildClientTLSConfig/validateTLSsurfaces but emitted zero Events and had no TLS status condition — so a misconfigured TLS spec failed silently. This closes that observability gap.TLSReadystatus condition inupdateConditions, reflecting both peer and client surfaces with a per-surface reason vocabulary:TLSNotConfigured(omitted when both surfaces nil),TLSReady,IssuerNotFound,PeerCANotShared,ClientServerCAMismatch,ClientCertificateError,SurfaceNotReady. False with the specific reason on each failure; True only when every configured surface is healthy.Warningon each failure with the matching reason, a singleNormalTLSReadyon transition into ready.PeerCANotShareddetects a self-signed-leaf cert-manager peer issuer that cannot establish shared peer trust — turning a silent cap-at-1-member hazard into a visible signal; client-surface issuer existence and operator-client secret usability are checked from cluster objects.ClientServerCAMismatchis in the vocabulary but unreachable from a single client surface (server + operator-client share one issuer by construction) — left as a documented TODO.Unit tests drive every state→reason mapping, the peer-CA predicate against representative issuer inputs, and Event emission via
record.FakeRecorder.go build/go vetclean.Stacking
Stacked on #376 (independent peer/client TLS). Review/merge after #376; this branch contains #376's commits and will shrink to its own two commits once #376 merges.
Related work — etcd TLS & operability
Independent peer/client TLS reshape and surrounding operability work, in dependency / stacking order (
→marks this PR):spec.tls.{peer,client}surfaces; breaking alpha API change (no conversion webhook, by design)TLSReadycondition + TLS lifecycle EventsPeerCANotSharedThe TLS reshape (#376) supersedes the earlier conflated T2/T3/T4 plan (per-surface mounts, flags+scheme, and client
*tls.Confignow all live in #376). T5←#376 and T6←#377 are stacked: review/merge in order. T0, the reconcile/QoS knobs, and the stress harness are independent.