Skip to content

ROX-29674: Sync caBundle changes to ValidatingWebhookConfiguration #15706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

vladbologa
Copy link
Contributor

@vladbologa vladbologa commented Jun 12, 2025

Description

This PR implements an Operator-side mechanism that will enable Sensor to propagate a CA bundle to the ValidatingWebhookConfiguration of admission-control in a Secured Cluster, to support CA rotation.

Context

The admission-control service acts as a Kubernetes ValidatingWebhook. The Kubernetes API server needs to trust the TLS certificate presented by admission-control, and this trust is established via the caBundle field in the ValidatingWebhookConfiguration resource.

During CA rotation, Sensor obtains new TLS certificates from Central that were signed using a new CA. These certificates are stored as secrets that the services can access. In addition, it also needs to update the caBundle field of the ValidatingWebhookConfiguration.

Why this PR is needed

  • Sensor is not able to modify the ValidatingWebhookConfiguration resource, because it is managed by the Operator
  • additionally, to prevent downtime, the ValidatingWebhookConfiguration should learn about a new CA before admission-control starts using leaf certificates signed by it

Proposed solution

Since Sensor cannot modify ValidatingWebhookConfiguration, it will instead store the CA bundle in a ConfigMap (this is not implemented here).

The Operator will then watch this ConfigMap, and when it appears or it gets updated, it updates the caBundle of the ValidatingWebhookConfiguration accordingly.

Notes

  • on its own, this PR doesn't change current behavior (unless somebody manually creates the tls-ca-bundle ConfigMap)
  • the ConfigMap will need to have a app.stackrox.io/watched-by: operator label, so that the Operator adds it to its resource cache

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Deployed in an infra cluster with make -C operator deploy-via-olm.
Installed Central and added a Secured Cluster.
Created a test ConfigMap in the namespace and verified that changes to the tls-ca-bundle ConfigMap propagate to the ValidatingWebhookConfiguration resource.

Copy link

openshift-ci bot commented Jun 12, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@vladbologa vladbologa changed the title ROX-29674: Propagate CA bundle changes to the ValidatingWebhookConfiguration ROX-29674: Sync caBundle changes to ValidatingWebhookConfiguration Jun 12, 2025
@vladbologa vladbologa force-pushed the vb/admission-control-ca-bundle-reconciler branch from fd7db5a to 4abe2bb Compare June 12, 2025 12:19
@rhacs-bot
Copy link
Contributor

rhacs-bot commented Jun 12, 2025

Images are ready for the commit at 35f8e5e.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.9.x-259-g35f8e5e6fb.

Copy link

codecov bot commented Jun 12, 2025

Codecov Report

Attention: Patch coverage is 97.56098% with 1 line in your changes missing coverage. Please review.

Project coverage is 48.69%. Comparing base (dd48fea) to head (35f8e5e).
Report is 38 commits behind head on master.

Files with missing lines Patch % Lines
...l/securedcluster/values/translation/translation.go 96.66% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #15706    +/-   ##
========================================
  Coverage   48.69%   48.69%            
========================================
  Files        2602     2605     +3     
  Lines      191502   191740   +238     
========================================
+ Hits        93249    93365   +116     
- Misses      90923    91041   +118     
- Partials     7330     7334     +4     
Flag Coverage Δ
go-unit-tests 48.69% <97.56%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vladbologa vladbologa force-pushed the vb/admission-control-ca-bundle-reconciler branch 3 times, most recently from ec4110a to 5e83358 Compare June 12, 2025 16:00
@vladbologa vladbologa force-pushed the vb/admission-control-ca-bundle-reconciler branch from 5e83358 to 747a358 Compare July 3, 2025 09:54
Copy link
Contributor

Caution

There are some errors in your PipelineRun template.

PipelineRun Error
quay-proxy no kind "ImageDigestMirrorSet" is registered for version "config.openshift.io/v1" in scheme "k8s.io/client-go/kubernetes/scheme/register.go:83"

@vladbologa vladbologa force-pushed the vb/admission-control-ca-bundle-reconciler branch 4 times, most recently from a9c4842 to 8787e47 Compare July 15, 2025 15:54
@vladbologa vladbologa force-pushed the vb/admission-control-ca-bundle-reconciler branch 2 times, most recently from bb05024 to 31a2e1e Compare July 15, 2025 20:29
@vladbologa vladbologa force-pushed the vb/admission-control-ca-bundle-reconciler branch from f5dfb2e to a0cc8ce Compare July 16, 2025 14:18
@vladbologa vladbologa marked this pull request as ready for review July 16, 2025 14:18
@vladbologa vladbologa requested a review from a team as a code owner July 16, 2025 14:18
@vladbologa vladbologa requested review from GrimmiMeloni and removed request for a team July 16, 2025 14:18
Copy link
Contributor

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would it be to change the caching to look at an additional label as an alternative to the current one? The current approach of using the same label will probably work fine but I think a separate one would reduce confusion later....

// This is needed so that the Operator can update the ValidatingWebhookConfiguration's caBundle field.
caBundle, err := t.getCABundleFromConfigMap(ctx, sc)
if err != nil {
t.logger.Error(err, "failed to get CA bundle from ConfigMap", "configMap", securedcluster.CABundleConfigMapName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the errors should be propagated rather than ignored. Otherwise the bundle might get misconfigured on transient client.Get errors for example 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. My thinking was that since this ConfigMap is "optional" then if it can't be read we should just use the fallback CA from the TLS secrets. But my approach might actually cause admission-control to stop working temporarily if there are transient errors.

}

func (p *CreateOrUpdateWithNamePredicate[T]) Delete(_ event.TypedDeleteEvent[T]) bool {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to trigger on deletions too? A deletion will be processed eventually (when operator restarts), so ignoring it here seems a bit artificial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that deletion will be processed eventually, but I would tend towards not implementing support in the predicate. Because deletion is not part of the workflow, I think that the current approach communicates intent better.

@vladbologa vladbologa force-pushed the vb/admission-control-ca-bundle-reconciler branch from 95aab6d to 204e840 Compare July 21, 2025 19:39

caBundlePEM, ok := configMap.Data[caBundleKey]
if !ok {
return "", errors.Errorf("key %q not found in ConfigMap %s", caBundleKey, key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wondering if it would make sense to treat this case the same as "config map not found". 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say no, because "not found" is a valid state (e.g. Sensor didn't create the ConfigMap yet). The ConfigMap existing but without the key is not a valid state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. I guess I was just looking at it from a defensive programming direction (be strict in what you publish, be permissive in what you consume).
Feel free to resolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good point, but I would still be wary in this particular case. Being permissive here will result in falling back to the original CA, and that could actually cause break policy enforcement in admission-controller during CA rotation.

}{
{
name: "ConfigMap does not exist should return empty string without error",
setupClient: func(t *testing.T) ctrlClient.Client {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about reorganizing this a bit with the goal of making the test a bit concise:
Couldn't we just add something like a []&v1.ConfigMap to the test-case struct and instantiate the fake client builder with these objects pre-created in Run below (~ L1305)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1222,3 +1223,197 @@ func createSecret(name string) *v1.Secret {
},
}
}

func TestGetCABundleFromConfigMap(t *testing.T) {
const testNamespace = "stackrox"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a non-standard namespace to make sure that there is no implicit relying on the stackrox namespace during the fetching of the CM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the namespace in this test and it works, but in the other test that I added (TestTranslateWithCABundle) it doesn't. That's because all the existing helper functions in this file are hardcoding the stackrox namespace.

Should I make it configurable? It feels like an useful change, but I don't like to mix unrelated changes in the same PR.

@vladbologa vladbologa force-pushed the vb/admission-control-ca-bundle-reconciler branch from f4f2ef0 to 204e840 Compare July 22, 2025 14:51
@vladbologa
Copy link
Contributor Author

How difficult would it be to change the caching to look at an additional label as an alternative to the current one? The current approach of using the same label will probably work fine but I think a separate one would reduce confusion later....

It doesn't seem easy. The labels.SelectorFromSet method does AND, not OR. I was thinking of implementing a custom labels.Selector, but that also doesn't seem to work. Internally, the cache calls the String method of the selector, and there's no way to write a string that does what we want and would be accepted by the k8s API. (i.e. select either of two different keys)

Something that would work is to use the same key, e.g. "app.stackrox.io/managed-by" in ("operator", "sensor") and then the Operator would also cache resources managed by Sensor. WDYT?

(there's a caveat, Sensor currently uses app.kubernetes.io/managed-by: sensor instead of app.stackrox.io/managed-by: sensor)

Copy link

openshift-ci bot commented Jul 22, 2025

@vladbologa: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocp-4-12-qa-e2e-tests 35f8e5e link false /test ocp-4-12-qa-e2e-tests
ci/prow/ocp-4-19-qa-e2e-tests 35f8e5e link false /test ocp-4-19-qa-e2e-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@porridge
Copy link
Contributor

How difficult would it be to change the caching to look at an additional label as an alternative to the current one? The current approach of using the same label will probably work fine but I think a separate one would reduce confusion later....

It doesn't seem easy. The labels.SelectorFromSet method does AND, not OR. I was thinking of implementing a custom labels.Selector, but that also doesn't seem to work. Internally, the cache calls the String method of the selector, and there's no way to write a string that does what we want and would be accepted by the k8s API. (i.e. select either of two different keys)

Something that would work is to use the same key, e.g. "app.stackrox.io/managed-by" in ("operator", "sensor") and then the Operator would also cache resources managed by Sensor. WDYT?

(there's a caveat, Sensor currently uses app.kubernetes.io/managed-by: sensor instead of app.stackrox.io/managed-by: sensor)

LOL, nothing is ever easy 😄 Thank you for the investigation!
How about sticking to your original approach and filing a ticket to converge to the same label key eventually?

@vladbologa
Copy link
Contributor Author

How difficult would it be to change the caching to look at an additional label as an alternative to the current one? The current approach of using the same label will probably work fine but I think a separate one would reduce confusion later....

It doesn't seem easy. The labels.SelectorFromSet method does AND, not OR. I was thinking of implementing a custom labels.Selector, but that also doesn't seem to work. Internally, the cache calls the String method of the selector, and there's no way to write a string that does what we want and would be accepted by the k8s API. (i.e. select either of two different keys)
Something that would work is to use the same key, e.g. "app.stackrox.io/managed-by" in ("operator", "sensor") and then the Operator would also cache resources managed by Sensor. WDYT?
(there's a caveat, Sensor currently uses app.kubernetes.io/managed-by: sensor instead of app.stackrox.io/managed-by: sensor)

LOL, nothing is ever easy 😄 Thank you for the investigation! How about sticking to your original approach and filing a ticket to converge to the same label key eventually?

How about using app.stackrox.io/managed-by: sensor (instead of the app.stackrox.io/managed-by: operator of my original approach) and also filing a ticket to consolidate the labels of Sensor?

@porridge
Copy link
Contributor

How about using app.stackrox.io/managed-by: sensor (instead of the app.stackrox.io/managed-by: operator of my original approach) and also filing a ticket to consolidate the labels of Sensor?

Sorry, I'm confused. Which component would set which labels on which resources, and how would the label selector for operator caching look like then? 🤔

@vladbologa
Copy link
Contributor Author

How about using app.stackrox.io/managed-by: sensor (instead of the app.stackrox.io/managed-by: operator of my original approach) and also filing a ticket to consolidate the labels of Sensor?

Sorry, I'm confused. Which component would set which labels on which resources, and how would the label selector for operator caching look like then? 🤔

In my initial approach, I was proposing that Sensor would put the app.stackrox.io/managed-by: operator label on the tls-ca-bundle ConfigMap, so that the Operator can cache it. Using that label is misleading though, because tls-ca-bundle is actually managed by Sensor.

So the alternative is that Sensor sets app.stackrox.io/managed-by: sensor. Then in the Operator I could change the cache selector to something like:

req, err := labels.NewRequirement(
    "app.stackrox.io/managed-by",
    selection.In,
    []string{"operator", "sensor"},
)
cacheLabelSelector := labels.NewSelector().Add(*req)

I think this should work, because it's the same label key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants