-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
We use the tigera-operator to manage our Calico installation. After upgrading our production environment from Calico version 3.30.3 to 3.31.0 (version 1.38.6 to 1.40.0 of the tigera-operator), we began getting TLS errors from the calico-apiserver. Here is one of the related errors from the kube-apiserver log:
loading OpenAPI spec for "v3.projectcalico.org" failed with: failed to download v3.projectcalico.org: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: error trying to reach service: tls: failed to verify certificate: x509: certificate is valid for calico-api, calico-api.calico-apiserver, calico-api.calico-apiserver.svc, calico-api.calico-apiserver.svc.cluster.local, not calico-api.calico-system.svc
As mentioned by the release notes, this upgrade moved calico-apiserver from the calico-apiserver namespace to the calico-system namespace.
I looked at the cert in the calico-apiserver-certs secret, in both the tigera-operator namespace and the calico-system namespace, and verified the Subject Alternative Names in the certificate were in fact for the old calico-apiserver namespace, as indicated by the error log message. (The secret in calico-system had just been created, and I believe copied from the one in tigera-operator).
To force the certificate to be reissued, I deleted the calico-apiserver-certs secret in the tigera-operator namespace, and a short time later it was recreated with a new certificate where the Subject Alternative Names now correctly contain the calico-system namespace.
This issue did not occur in our dev environment. Our dev environment is destroyed and recreated from scratch on a regular basis, so when we did the upgrade test there, it was upgrading a fresh calico 3.30.3 installation to 3.31.0. In our production environment, the calico installation is several years old and has been upgraded repeatedly over time. At least one other time, we had certificate-related problem that ended up being related to certificates being created differently in the past, and I'm taking a guess that could be the case here as well (but certainly could be mistaken). The CA that signed the old certificate was tigera-operator-signer@xxxxxxxxxx, but after forcing it to be reissued is now just tigera-operator-signer (which I point out just to show that our certificates were the older ones which were generated differently in the past, as also discussed on that other issue).
Now that our production environment is upgraded, I unfortunately can no longer reproduce this state, but wanted to post in case someone else also runs into this.
Your Environment
- Calico version: 3.31.0 open source edition
- Calico dataplane (bpf, nftables, iptables, windows etc.): iptables
- Orchestrator version (e.g. kubernetes, openshift, etc.): EKS 1.34