Skip to content

Conversation

@youngbupark
Copy link

  • This is the initial draft for supporting rollout plugin in ManifestWorkReplicaSet Work Controller
  • Note: rollback will be added when we propose MWRS automatic rollback enhancement.

@openshift-ci
Copy link

openshift-ci bot commented Oct 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: youngbupark
Once this PR has been reviewed and has the lgtm label, please assign qiujian16 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The following service defines the contract between Work Controller and the plugin. Each call must be idempotent, stateless, and time-bounded (≤30 s) to ensure consistent controller reconciliation. Plugin server must implement the following APIs. The helpers to implement server and clients will be implemented in [ocm/sdk-go](https://github.com/open-cluster-management-io/sdk-go) repository.

```proto
// RolloutPluginService is the service for the rollout plugin.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the initial commit of gRPC server proto - open-cluster-management-io/sdk-go#154

Note: The implementation can change as we develop.

@youngbupark youngbupark changed the title KEP: ManifestWorkReplicaSet Rollout Plugin ManifestWorkReplicaSet Rollout Plugin Oct 28, 2025
@youngbupark youngbupark marked this pull request as ready for review October 29, 2025 02:00
@openshift-ci openshift-ci bot requested review from deads2k and qiujian16 October 29, 2025 02:00
// RolloutPluginService is the service for the rollout plugin.
service RolloutPluginService {
// Initialize initializes the plugin.
rpc Initialize(InitializeRequest) returns (InitializeResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need some clarification on error handling. What happens when a specific call fails? How would mwrs consumer to know and debug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I will add error handling section.

// If the validation is completed successfully, the plugin should return a OK result.
// If the validation is still in progress, the plugin should return a INPROGRESS result.
// If the validation is failed, the plugin should return a FAILED result.
rpc ValidateRollout(RolloutPluginRequest) returns (ValidateResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this be called in mwrs reconciler? I think a flow on when these APIs will be called in mwrs controller will be helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc includes mermaid sequence diagram to describe how each hook will be called. Please check this out.

// BeginRollout is called before the manifestwork resource is applied.
// It is used to prepare the rollout.
rpc BeginRollout(RolloutPluginRequest) returns (google.protobuf.Empty);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean any spec change on manifestwork will trigger this? What if placement changes but mw spec does not change in mwrs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this BeginRollout will be called before creating manifestwork in each cluster (please see the mermaid sequence diagram) BeginRollout will be called whenever MWRS creates or update manifestwork in each cluster namespace.

- rolloutStatus: The current [cluster rollout status](https://github.com/open-cluster-management-io/sdk-go/blob/main/pkg/apis/cluster/v1alpha1/rollout.go#L23-L39) (e.g., ToApply, Progressing, Succeeded, Failed, TimeOut, Skip).
- manifestRevisionName: The name of the manifest revision applied to the cluster.

### Configure custom plugin for work controller
Copy link
Member

@haoqing0110 haoqing0110 Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a plugin is enabled, what's the behavior of a normal mwrs rollout which does not need to call any plugin?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a plugin is enabled, what's the behavior of a normal mwrs rollout which does not need to call any plugin?

Plugin will be applied to all mwrs. @haoqing0110 do you think it is better to make it opt-in ?

workDriver: kube
# plugin configuration
plugins:
- name: my-rollout
Copy link
Author

@youngbupark youngbupark Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiujian16 @haoqing0110 rather than using sidecar model, I agree to use standalone service. here is new cluster-manager resource model to register the plugin. new one supports multiple plugins and users can select its plugin if they need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

// ProgressRollout is called after the manifestwork is applied.
// Whenever the feedbacks are updated, this method will be called.
// The plugin can execute the rollout logic based on the feedback status changes.
rpc ProgressRollout(RolloutPluginRequest) returns (google.protobuf.Empty);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annelaucg I removed Rollback specific operation as we discussed yesterday. it is much simpler.

workDriver: kube
# plugin configuration
plugins:
- name: my-rollout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

# optional. secretRef is the reference for ca
secretRef:
name: my-rollout-ca
namespace: default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namespace might not be needed. The secret has to be put in the open-cluster-management-hub ns.

Work->>Work: Create rollout handler
Work->>Work: Find rollout/removed/timeout candidate clusters (RolloutResult)
alt timeout clusters exists and .spec.placementRefs[*].rolloutStrategy.abortOnFailure is true
Note over Work, PluginServer: Start automatic abort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to set abort when Degraded or ValidateFailed.

So this will also need to be set during the ProgressRollout func and the ValidateRolloutFunc is that right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this will be set by MWRS controller. Please see the proposal in #164

rolloutStrategy:
type: Progressive
# plugin is optional.
plugin: my-rollout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need any failsafe if the plugin that the user adds causes issues? Or is that dependent on the user?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on failsafe ? In general, user should be able to recognize it. using plugin is opt-in feature. MWRS should show the error in status if Plugin is throwing error or unavailable. It might be better to stop the reconciler loop rather than self-resolving the problem.

namespace: {{ .ClusterManagerNamespace }}
data:
config.yaml: |
plugins:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiujian16 can you please review configmap? I wonder if we need to define plugins as separate plugin config file.

#### Error handling

gRPC status codes follow the [standard gRPC status codes](https://grpc.github.io/grpc/core/md_doc_statuscodes.html): 0 = OK, 1 = CANCELLED, 2 = UNKNOWN, 3 = INVALID_ARGUMENT, 4 = DEADLINE_EXCEEDED, etc. Work controller will also utilize the [standard gRPC retry](https://grpc.io/docs/guides/retry/) for `UNAVAILABLE` status code.
When work controller fails to call plugin APIs, the failure reason from plugin server is shown in `PluginLoaded` status condition message for debugging purpose.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiujian16 Generally I would like to show the error message in PluginLoaded status condition, but for the errors while executing rollout, Progressing status condition might be the right place rather than PluginLoadded cc/ @annelaucg @qiujian16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants