Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,365 @@
# ClusterSet-Level Manifest Overrides for ManifestWorkReplicaSet

## Release Signoff Checklist

- [ ] Enhancement is `provisional`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/)

## Summary

This proposal introduces clusterSet-level manifest override capabilities to ManifestWorkReplicaSet (MWRS), enabling users to customize manifests for specific clusterSets while maintaining the progressive rollout and scalability benefits of MWRS.

## Motivation

### Current Limitations

ManifestWorkReplicaSet currently uses a simple template-based approach where the same `ManifestWorkTemplate` is applied to all selected clusters. This creates limitations when:

1. **Environment-specific configurations** are needed (e.g., different replica counts for prod vs staging clusterSets)
2. **ClusterSet-specific resource requirements** vary (e.g., different CPU/memory limits based on clusterSet capacity profiles)
3. **Per-clusterSet customizations** are required (e.g., different images, environment variables, or labels)
4. **Namespace variations** exist across clusterSets (e.g., different naming conventions)

### Use Cases

- **Multi-environment deployments**: Deploy the same application with environment-specific configurations (prod clusterSet: 5 replicas, staging clusterSet: 1 replica)
- **ClusterSet capacity optimization**: Adjust resource requests/limits based on clusterSet capacity profiles
- **Compliance requirements**: Apply different security contexts or labels based on clusterSet compliance levels
- **Regional customizations**: Use different images or configurations for different geographical clusterSets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another use case would be different configurations in different cloud providers.


## Proposal

### API Design

Extend the `ManifestWorkReplicaSetSpec` to include clusterSet-level overrides:

```go
type ManifestWorkReplicaSetSpec struct {
// Existing fields
ManifestWorkTemplate work.ManifestWorkSpec `json:"manifestWorkTemplate"`
PlacementRefs []LocalPlacementReference `json:"placementRefs"`
CascadeDeletionPolicy CascadeDeletionPolicy `json:"cascadeDeletionPolicy,omitempty"`

// New field for clusterSet-level overrides
ClusterSetOverrides []ClusterSetManifestOverride `json:"clusterSetOverrides,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name it as Override, and user can choose to override by clusterset, or placement or per cluster. Try to make it extensible enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. In that case , We'll need to have this logic to iterate from Coarse Grained selector to Fine Grained selector while creating manifest works as same resource can be selected by two selectors.
Placement Override -> ClusterSet Override -> Cluster Override

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only separate by type, e.g.

overrides:
  overrideStrategy: clustersets
   clustersets:
   - name: clusterset1
     patch: .....
  - name: clusterset2

and we can also choose placements

overrides:
  overrideStrategy: placements
   clustersets:
   - name: aws-placement
     patch: .....
  - name: gke-placement

}

type ClusterSetManifestOverride struct {
// Target clusterSet name
ClusterSetName string `json:"clusterSetName"`

// Manifest-specific overrides
ManifestOverrides []ManifestOverride `json:"manifestOverrides,omitempty"`
}

type ManifestOverride struct {
// Identifies the target manifest
ResourceIdentifier ResourceIdentifier `json:"resourceIdentifier"`

// Override specifications
Override OverrideSpec `json:"override"`
}

type ResourceIdentifier struct {
APIVersion string `json:"apiVersion,omitempty"`
Kind string `json:"kind,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APIGroup/Version/Resource is preferred. We can reuse the existing resourceIdentifier

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will use the existing resource identifier.

Name string `json:"name,omitempty"`
Namespace string `json:"namespace,omitempty"`
}

type OverrideSpec struct {
// Metadata overrides (name, namespace, labels, annotations)
MetadataOverride *MetadataOverride `json:"metadataOverride,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this for, why it cannot be part of general override?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially thought of allowing template based override in metadataOverride like labels.cluster : {{clusterName}}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need to treat spec and metadata differently.


// JSON patch operations for spec modifications
SpecOverride []JSONPatch `json:"specOverride,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it more extensible, someone might want to use json merge patch also.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I will introduce Replace , StrategicMerge as two possible strategies for patching things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StrategicMerge will be tricky...we can start with JSONPatch/JSONMerge/Replace


// Complete manifest replacement
ReplaceManifest *runtime.RawExtension `json:"replaceManifest,omitempty"`
}
```

### Override Types

1. **Metadata Overrides**: Modify resource metadata (name, namespace, labels, annotations)
2. **Spec Overrides**: Use JSON patches to modify any field in the resource spec
3. **Complete Replacement**: Replace the entire manifest for specific clusters

### Implementation Strategy

#### Phase 1: Core Override Engine
- Implement override application logic in the controller
- Add resource identification and matching capabilities
- Support metadata and JSON patch overrides

#### Phase 2: Enhanced Features
- Add validation webhooks for override syntax
- Implement complete manifest replacement
- Add comprehensive test coverage

#### Phase 3: Advanced Capabilities
- Support for conditional overrides based on cluster labels
- Template-based overrides with cluster context variables
- Override inheritance and composition

### Example Usage

```yaml
apiVersion: work.open-cluster-management.io/v1alpha1
kind: ManifestWorkReplicaSet
metadata:
name: web-app-deployment
namespace: default
spec:
manifestWorkTemplate:
workload:
manifests:
- apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web
image: nginx:1.20
resources:
requests:
cpu: 100m
memory: 128Mi

clusterSetOverrides:
- clusterSetName: "prod-clusterset"
manifestOverrides:
- resourceIdentifier:
apiVersion: apps/v1
kind: Deployment
name: web-app
override:
metadataOverride:
labels:
environment: production
tier: high-performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part a merge or replace operation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace

specOverride:
- op: replace
path: /spec/replicas
value: 5
- op: replace
path: /spec/template/spec/containers/0/resources/requests/cpu
value: 500m
- op: replace
path: /spec/template/spec/containers/0/resources/requests/memory
value: 512Mi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the type of the value it should be? number or string?


- clusterSetName: "staging-clusterset"
manifestOverrides:
- resourceIdentifier:
apiVersion: apps/v1
kind: Deployment
name: web-app
override:
metadataOverride:
labels:
environment: staging
tier: standard
specOverride:
- op: replace
path: /spec/replicas
value: 1
- op: replace
path: /spec/template/spec/containers/0/resources/requests/cpu
value: 50m

placementRefs:
- name: web-app-placement
rolloutStrategy:
type: Progressive
progressive:
minSuccessTime: 2m
progressDeadline: 10m
```

### JSON Patch Examples

The `specOverride` field uses RFC 6902 JSON Patch operations. Here are common patch patterns:

#### Replace Operations
```yaml
specOverride:
- op: replace
path: /spec/replicas
value: 3
- op: replace
path: /spec/template/spec/containers/0/image
value: nginx:1.21
```

#### Add Operations
```yaml
specOverride:
- op: add
path: /spec/template/spec/containers/0/env
value:
- name: ENVIRONMENT
value: production
- op: add
path: /spec/template/spec/containers/0/volumeMounts
value:
- name: config-volume
mountPath: /etc/config
```

#### Remove Operations
```yaml
specOverride:
- op: remove
path: /spec/template/spec/containers/0/resources/limits
- op: remove
path: /spec/template/spec/containers/0/env/1
```

#### Complex Nested Operations
```yaml
specOverride:
- op: replace
path: /spec/template/spec/containers/0/resources
value:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
- op: add
path: /spec/template/spec/securityContext
value:
runAsNonRoot: true
runAsUser: 1000
```

## Design Principles

### 1. Backward Compatibility
- Existing ManifestWorkReplicaSets continue to work unchanged
- New override field is optional
- No breaking changes to current API

### 2. Progressive Rollout Preservation
- All existing rollout strategies work with clusterSet overrides
- Override application happens during ManifestWork creation based on cluster's clusterSet membership
- Rollout behavior remains consistent across clusterSets

### 3. Flexibility and Power
- Support multiple override types (metadata, JSON patches, replacement)
- Precise resource targeting using ResourceIdentifier
- ClusterSet-based grouping provides logical organization
- Extensible design for future enhancements

### 4. Performance Considerations
- Overrides applied at ManifestWork creation time, not runtime
- Efficient manifest matching using structured identifiers
- ClusterSet-based approach reduces override complexity
- Minimal impact on controller performance

## Implementation Details

### Controller Changes

1. **Enhanced CreateManifestWork Function**
```go
func CreateManifestWorkWithOverrides(
mwrSet *workapiv1alpha1.ManifestWorkReplicaSet,
clusterName string,
clusterSetName string,
placementRefName string,
) (*workv1.ManifestWork, error)
```

2. **ClusterSet Resolution Logic**
- Determine cluster's clusterSet membership
- Match clusterSet name with override specifications
- Handle clusters belonging to multiple clusterSets

3. **Override Application Logic**
- Resource identification and matching
- Metadata override application
- JSON patch processing
- Manifest replacement handling

4. **Validation and Error Handling**
- Validate override syntax
- Validate clusterSet references
- Handle missing target resources gracefully
- Provide clear error messages for invalid overrides

### Dependencies

- **JSON Patch Library**: `github.com/evanphx/json-patch/v5` for spec overrides
- **Unstructured Objects**: Enhanced use of `k8s.io/apimachinery/pkg/apis/meta/v1/unstructured`

## Testing Strategy

### Unit Tests
- Override application logic
- Resource identification and matching
- Error handling scenarios

### Integration Tests
- End-to-end override scenarios
- Rollout behavior with overrides
- Multiple cluster configurations

### E2E Tests
- Real cluster deployments with overrides
- Progressive rollout validation
- Performance impact assessment

## Migration and Rollout

### Phase 1: Development (4-6 weeks)
- Implement core API changes
- Develop override application logic
- Add basic validation

### Phase 2: Testing (2-3 weeks)
- Comprehensive test coverage
- Performance validation
- Documentation updates

### Phase 3: Release (1-2 weeks)
- Feature flag for gradual rollout
- Community feedback integration
- Production readiness validation

## Alternatives Considered

### 1. Cluster-Level Overrides
**Rejected**: While providing fine-grained control, cluster-level overrides would result in verbose configurations and reduced maintainability for large-scale deployments. ClusterSet-level overrides provide better logical grouping.

### 2. Placement-Level Overrides
**Rejected**: Would require changes to Placement API and doesn't provide the right level of granularity for environment-based configurations.

### 3. Separate Override Resources
**Rejected**: Increases complexity and makes it harder to manage overrides alongside MWRS.

### 4. Template-Based Approach
**Deferred**: Could be added in Phase 3 as an advanced feature, but JSON patches provide more flexibility initially.


## Conclusion

ClusterSet-level manifest overrides will significantly enhance ManifestWorkReplicaSet's flexibility while maintaining its core strengths of scalable, progressive workload deployment. This feature addresses real-world use cases where one-size-fits-all deployments are insufficient, enabling more sophisticated multi-cluster application management with logical grouping.

The proposed design is backward-compatible, performant, and extensible, providing a more maintainable approach than cluster-level overrides while still offering the necessary granularity for environment-based configurations. This makes it a valuable addition to OCM's workload management capabilities.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
title: manifestworkreplicaset-clusterset-override-provision
authors:
- "@ncr38"
reviewers:
- TBD
approvers:
- TBD
creation-date: 2025-09-29
last-updated: 2025-09-29
status: provisional