Skip to content

Conversation

@mdbooth
Copy link
Contributor

@mdbooth mdbooth commented May 12, 2025

This change uses Server-Side Apply to apply provider manifests in place of the logic from library-go, which uses an Update with custom kind-specific client-side merge logic. Using SSA here should be equivalent to how these manifests would be applied by any other client.

The merge logic from library-go should no longer be required as long as the manifests are not specifying any fields which will be overwritten by an existing controller.

In particular, there should be no conflict with service-ca setting caBundle in various places, as long as the specified manifests do not include a caBundle. However, note that due to a validation bug in older versions of k8s, some CRDs do still specify an empty caBundle in their CRDs. These would have to be removed for this to work.

The expected flow of reconciles between cluster-capi-operator and service-ca-operator becomes:

  • cluster-capi-operator: apply initial manifests
  • service-ca-operator: adds caBundle to CRDs and validating webhooks
  • cluster-capi-operator: triggered by update to manifests it managed applies manifests again
  • The final update will produce no change as long as no managed fields have been updated, so there will be no further reconciles.

However, objects with the provider label were being reconciled too often due to insufficient filtering. This PR includes a second commit which addresses that.

mdbooth added 2 commits May 12, 2025 14:58
We were unconditionally triggering reconciles for all modifications to
any managed object. This was producing a large number of unnecessary
reconciles when the status of managed objects was updated.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign theobarberbany for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@damdo
Copy link
Member

damdo commented May 12, 2025

I'll try and review this one in the coming days. Also as a domain expert it'd probably be good to have a review from @JoelSpeed

Comment on lines +273 to +278
// We reconcile all Deployment changes because we intend to reflect the
// status of any created Deployment in the ClusterOperator status.
Watches(
&appsv1.Deployment{},
handler.EnqueueRequestsFromMapFunc(toClusterOperator),
builder.WithPredicates(ownedPlatformLabelPredicate(r.ManagedNamespace, r.Platform)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember when we discussed this we decided to leave this in because of functionality we intend to implement in the future. Honestly I'd leave it out and add it back when we add something which needs it. Until then it's just adding noise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. See my comment above re:
TODO: Deployments State/Conditions should influence the overall ClusterOperator Status.

I think ideally we should be looking at doing this considering the upcoming GAing of the operator.

Maybe something we can tackle in tandem with the un-revert of: #273

In this PR we should at least see if we are introducing anything that would prevent us easily doing that.

if r.Error != nil {
errs = errors.Join(errs, fmt.Errorf("error applying CAPI provider component %q at position %d: %w", r.File, i, r.Error))
for i, providerObject := range providerObjects {
err := r.Patch(ctx, providerObject, client.Apply, client.ForceOwnership, client.FieldOwner("cluster-capi-operator.openshift.io/installer"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bike shedding opportunity: what to call this field owner? I think this is good, but I'm just calling it out.

Should probably move the field owner to a constant somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places we used controllerName + suffix

if err := r.Status().Patch(ctx, m, util.ApplyConfigPatch(ac), client.ForceOwnership, client.FieldOwner(controllerName+"-AuthoritativeAPI")); err != nil {
return fmt.Errorf("failed to patch Machine API machine status with authoritativeAPI %q: %w", authority, err)
}
for these

@mdbooth
Copy link
Contributor Author

mdbooth commented May 12, 2025

/test e2e-openstack-capi-techpreview

UpdateFunc: func(e event.UpdateEvent) bool { return isClusterOperator(e.ObjectNew) },
GenericFunc: func(e event.GenericEvent) bool { return isClusterOperator(e.Object) },
DeleteFunc: func(e event.DeleteEvent) bool { return isClusterOperator(e.Object) },
CreateFunc: func(e event.CreateEvent) bool { return isClusterOperator(e.Object) },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genuine question Is there anything in the ClusterOperator we might be interested in reconciling at the moment?
Like anything in the status/condititions?

if r.Error != nil {
errs = errors.Join(errs, fmt.Errorf("error applying CAPI provider component %q at position %d: %w", r.File, i, r.Error))
for i, providerObject := range providerObjects {
err := r.Patch(ctx, providerObject, client.Apply, client.ForceOwnership, client.FieldOwner("cluster-capi-operator.openshift.io/installer"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places we used controllerName + suffix

if err := r.Status().Patch(ctx, m, util.ApplyConfigPatch(ac), client.ForceOwnership, client.FieldOwner(controllerName+"-AuthoritativeAPI")); err != nil {
return fmt.Errorf("failed to patch Machine API machine status with authoritativeAPI %q: %w", authority, err)
}
for these

return fmt.Errorf("error parsing CAPI provider deployment manifets %q: %w", d, err)
}

// TODO: Deployments State/Conditions should influence the overall ClusterOperator Status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still something we should be looking at doing. At the moment we are blind on this front, if the core CAPI/provider CAPI Deployments fail/are not Available and/or Ready, we are still reporting we are Available and not Degraded.
Ideally we should keep an eye on them (we already set up watches for these), and set Degraded/Not Degraded (?) accordingly (i.e. setDegradedCondition())

Comment on lines +273 to +278
// We reconcile all Deployment changes because we intend to reflect the
// status of any created Deployment in the ClusterOperator status.
Watches(
&appsv1.Deployment{},
handler.EnqueueRequestsFromMapFunc(toClusterOperator),
builder.WithPredicates(ownedPlatformLabelPredicate(r.ManagedNamespace, r.Platform)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. See my comment above re:
TODO: Deployments State/Conditions should influence the overall ClusterOperator Status.

I think ideally we should be looking at doing this considering the upcoming GAing of the operator.

Maybe something we can tackle in tandem with the un-revert of: #273

In this PR we should at least see if we are introducing anything that would prevent us easily doing that.

if r.Error != nil {
errs = errors.Join(errs, fmt.Errorf("error applying CAPI provider component %q at position %d: %w", r.File, i, r.Error))
for i, providerObject := range providerObjects {
err := r.Patch(ctx, providerObject, client.Apply, client.ForceOwnership, client.FieldOwner("cluster-capi-operator.openshift.io/installer"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any obvious difference that we need to be aware of/account for, between r.Patch & client.Apply + SSA vs resourceapply.ApplyDeployment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it at the time, and convinced myself the answer is no. I could do that again...

However, intuitively this is fine. Consider that this is how all client tooling applies this. Deployments do not require special consideration. Perhaps they once did.

if r.Error != nil {
errs = errors.Join(errs, fmt.Errorf("error applying CAPI provider component %q at position %d: %w", r.File, i, r.Error))
for i, providerObject := range providerObjects {
err := r.Patch(ctx, providerObject, client.Apply, client.ForceOwnership, client.FieldOwner("cluster-capi-operator.openshift.io/installer"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question

for i, providerObject := range providerObjects {
err := r.Patch(ctx, providerObject, client.Apply, client.ForceOwnership, client.FieldOwner("cluster-capi-operator.openshift.io/installer"))
if err != nil {
gvk := providerObject.GroupVersionKind()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the GVK string function and then join that to the name?

err := r.Patch(ctx, providerObject, client.Apply, client.ForceOwnership, client.FieldOwner("cluster-capi-operator.openshift.io/installer"))
if err != nil {
gvk := providerObject.GroupVersionKind()
name := strings.Join([]string{gvk.Group, gvk.Version, gvk.Kind, providerObject.GetName()}, "/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the namespace matter?

Comment on lines +35 to +38
// We only want to be reconciled on creation of the cluster operator,
// because we wait for it before reconciling. The Create event also fires
// when the manager is started, so this will additionally ensure we are
// called at least once at startup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was a generic event triggered by the List? Not create?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 10, 2025

@mdbooth: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn 37d52c9 link true /test e2e-aws-ovn
ci/prow/okd-scos-e2e-aws-ovn 37d52c9 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-azure-ovn-techpreview 37d52c9 link false /test e2e-azure-ovn-techpreview
ci/prow/e2e-aws-ovn-techpreview-upgrade 37d52c9 link true /test e2e-aws-ovn-techpreview-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2025
@damdo
Copy link
Member

damdo commented Oct 9, 2025

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants