Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ NVIDIA NIC Configuration Operator uses the [Maintenance Operator](https://github
### Prerequisites

* Kubernetes cluster
* [NVIDIA Network Operator](https://github.com/Mellanox/network-operator) deployed
* [NVIDIA Network Operator](https://github.com/Mellanox/network-operator) deployed. It is recommended to deploy the [DOCA-OFED driver](https://github.com/Mellanox/network-operator?tab=readme-ov-file#driver-containers)
* [Maintenance Operator](https://github.com/Mellanox/maintenance-operator) deployed

NVIDIA NIC Configuration Operator can be deployed as part of the [NIC Cluster Policy CRD](https://github.com/Mellanox/network-operator?tab=readme-ov-file#nicclusterpolicy-spec).
Expand Down Expand Up @@ -65,6 +65,7 @@ spec:
qos:
trust: dscp
pfc: "0,0,0,1,0,0,0,0"
tos: 0
gpuDirectOptimized:
enabled: true
env: Baremetal
Expand Down Expand Up @@ -92,9 +93,9 @@ spec:
* `ROCE_CC_PRIO_MASK_P1=255`, `ROCE_CC_PRIO_MASK_P2=255`
* `CNP_DSCP_P1=4`, `CNP_DSCP_P2=4`
* `CNP_802P_PRIO_P1=6`, `CNP_802P_PRIO_P2=6`
* Configure pfc (Priority Flow Control) for priority 3 and set trust to dscp on each PF
* Configure pfc (Priority Flow Control) for priority 3, set trust to dscp on each PF, set ToS (Type of Service) to 0.
* Non-persistent (need to be applied after each boot)
* Users can override values via `trust` and `pfc` parameters
* Users can override values via `trust`, `pfc` and `tos` parameters
* Can only be enabled with `linkType=Ethernet`
* `gpuDirectOptimized`: performs gpu direct optimizations. ATM only optimizations for Baremetal environment are supported. If enabled perform the following:
* Set nvconfig `ATS_ENABLED=0`
Expand Down Expand Up @@ -227,3 +228,11 @@ status:
#### Implementation details:

The NicDevice CRD is created and reconciled by the configuration daemon. The reconciliation logic scheme can be found [here](docs/nic-configuration-reconcile-diagram.png).

## Order of operations

To include the NIC Configuration Operator as part of network configuration workflows, strict order of operations might need to be enforced. For example, [SR-IOV Network Configuration Daemon](https://github.com/k8snetworkplumbingwg/sriov-network-operator) pod should start AFTER the NIC Configuration Daemon has finished.
To indicate when NIC configuration is in progress to the pods that depend on it, the operator manages the `nvidia.com/operator.nic-configuration.wait` label, which has the value `false` when the requested NIC configuration has successfuly been applied, and the value `true` when the NIC configuration is in progress.
To use this mechanism, the next pods in the pipeline can add `nvidia.com/operator.nic-configuration.wait=false` to their node label selectors. That way, they will automatically be evicted from the node when the NICs are being configured.

The NIC Configuration Daemon itself relies on the `network.nvidia.com/operator.mofed.wait=false` label to be present on the node as it requires the DOCA-OFED driver to be running for some of the configurations.
8 changes: 8 additions & 0 deletions cmd/nic-configuration-daemon/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ import (
"github.com/Mellanox/nic-configuration-operator/api/v1alpha1"
"github.com/Mellanox/nic-configuration-operator/internal/controller"
"github.com/Mellanox/nic-configuration-operator/pkg/configuration"
"github.com/Mellanox/nic-configuration-operator/pkg/consts"
"github.com/Mellanox/nic-configuration-operator/pkg/devicediscovery"
"github.com/Mellanox/nic-configuration-operator/pkg/dms"
"github.com/Mellanox/nic-configuration-operator/pkg/firmware"
Expand Down Expand Up @@ -169,6 +170,13 @@ func main() {

ctx := ctrl.SetupSignalHandler()

// Set the nic configuration wait label on the node to true until desired configuration is confirmed to be applied
err = maintenanceManager.SetNodeWaitLabel(ctx, consts.LabelValueTrue)
if err != nil {
log.Log.Error(err, "failed to set the nic configuration wait label on the node to true")
os.Exit(1)
}

err = mgr.GetCache().IndexField(ctx, &v1alpha1.NicDevice{}, "status.node", func(o client.Object) []string {
return []string{o.(*v1alpha1.NicDevice).Status.Node}
})
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ spec:
kubectl.kubernetes.io/default-container: nic-configuration-daemon
labels:
control-plane: nic-configuration-daemon
nvidia.com/nic-configuration-daemon: ""
{{- include "nic-configuration-operator.selectorLabels" . | nindent 8 }}
spec:
nodeSelector: {{- toYaml .Values.configDaemon.nodeSelector | nindent 8 }}
Expand Down
4 changes: 4 additions & 0 deletions pkg/consts/consts.go
Original file line number Diff line number Diff line change
Expand Up @@ -137,4 +137,8 @@ const (

OverlayNone = "none"
OverlayL3 = "l3"

NodeNicConfigurationWaitLabel = "network.nvidia.com/operator.nic-configuration.wait"
LabelValueTrue = "true"
LabelValueFalse = "false"
)
37 changes: 37 additions & 0 deletions pkg/maintenance/maintenancemanager.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,13 @@ package maintenance

import (
"context"
"fmt"

maintenanceoperator "github.com/Mellanox/maintenance-operator/api/v1alpha1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/meta"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/util/workqueue"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/event"
Expand Down Expand Up @@ -67,6 +70,7 @@ type MaintenanceManager interface {
ScheduleMaintenance(ctx context.Context) error
MaintenanceAllowed(ctx context.Context) (bool, error)
ReleaseMaintenance(ctx context.Context) error
SetNodeWaitLabel(ctx context.Context, value string) error
Reboot() error
}

Expand Down Expand Up @@ -132,6 +136,12 @@ func (m maintenanceManager) ScheduleMaintenance(ctx context.Context) error {
return err
}

err = m.SetNodeWaitLabel(ctx, consts.LabelValueTrue)
if err != nil {
log.Log.Error(err, "failed to set the nic configuration wait label on the node to true")
return err
}

return nil
}

Expand Down Expand Up @@ -179,6 +189,12 @@ func (m maintenanceManager) ReleaseMaintenance(ctx context.Context) error {
}
}

err = m.SetNodeWaitLabel(ctx, consts.LabelValueFalse)
if err != nil {
log.Log.Error(err, "failed to set the nic configuration wait label on the node to false")
return err
}

return nil
}

Expand All @@ -188,6 +204,27 @@ func (m maintenanceManager) Reboot() error {
return m.hostUtils.ScheduleReboot()
}

// SetNodeWaitLabel ensures the node has the network.nvidia.com/operator.nic-configuration.wait label with provided value.
// It performs a merge patch and is idempotent when the label already has the desired value.
func (m maintenanceManager) SetNodeWaitLabel(ctx context.Context, value string) error {
log.Log.Info("maintenanceManager.SetNodeLabel()", "node", m.nodeName, "key", consts.NodeNicConfigurationWaitLabel, "value", value)

var patch []byte
if value == "" {
// Remove label when value is empty
patch = []byte(fmt.Sprintf(`{"metadata":{"labels":{%q: null}}}`, consts.NodeNicConfigurationWaitLabel))
} else {
// Set/update label
patch = []byte(fmt.Sprintf(`{"metadata":{"labels":{%q: %q}}}`, consts.NodeNicConfigurationWaitLabel, value))
}

if err := m.client.Patch(ctx, &corev1.Node{ObjectMeta: metav1.ObjectMeta{Name: m.nodeName}}, client.RawPatch(types.StrategicMergePatchType, patch)); err != nil {
log.Log.Error(err, "failed to patch node label", "node", m.nodeName, "key", consts.NodeNicConfigurationWaitLabel, "value", value)
return err
}
return nil
}

func New(client client.Client, hostUtils host.HostUtils, nodeName string, namespace string) MaintenanceManager {
return maintenanceManager{client: client, hostUtils: hostUtils, nodeName: nodeName, namespace: namespace}
}
194 changes: 194 additions & 0 deletions pkg/maintenance/maintenancemanager_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
// Copyright 2025 NVIDIA CORPORATION & AFFILIATES
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//
// SPDX-License-Identifier: Apache-2.0

package maintenance

import (
"context"
"fmt"

. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"

maintenanceoperator "github.com/Mellanox/maintenance-operator/api/v1alpha1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/client/fake"

"github.com/Mellanox/nic-configuration-operator/pkg/consts"
hostmocks "github.com/Mellanox/nic-configuration-operator/pkg/host/mocks"
)

var _ = Describe("maintenanceManager", func() {
var (
ctx context.Context
scheme *runtime.Scheme
namespace string
nodeName string
)

BeforeEach(func() {
ctx = context.Background()
scheme = runtime.NewScheme()
Expect(corev1.AddToScheme(scheme)).To(Succeed())
Expect(maintenanceoperator.AddToScheme(scheme)).To(Succeed())
namespace = "test-ns"
nodeName = "test-node"
})

It("SetNodeLabel adds, updates and deletes a label via strategic merge patch", func() {
node := &corev1.Node{ObjectMeta: metav1.ObjectMeta{Name: nodeName}}
cl := fake.NewClientBuilder().WithScheme(scheme).WithObjects(node).Build()
m := maintenanceManager{client: cl, nodeName: nodeName}

// add
Expect(m.SetNodeWaitLabel(ctx, "value1")).To(Succeed())
updated := &corev1.Node{}
Expect(m.client.Get(ctx, types.NamespacedName{Name: nodeName}, updated)).To(Succeed())
Expect(updated.Labels).To(HaveKeyWithValue(consts.NodeNicConfigurationWaitLabel, "value1"))

// same value (no-op server-side)
Expect(m.SetNodeWaitLabel(ctx, "value1")).To(Succeed())

// update
Expect(m.SetNodeWaitLabel(ctx, "value2")).To(Succeed())
Expect(m.client.Get(ctx, types.NamespacedName{Name: nodeName}, updated)).To(Succeed())
Expect(updated.Labels).To(HaveKeyWithValue(consts.NodeNicConfigurationWaitLabel, "value2"))

// delete
Expect(m.SetNodeWaitLabel(ctx, "")).To(Succeed())
Expect(m.client.Get(ctx, types.NamespacedName{Name: nodeName}, updated)).To(Succeed())
Expect(updated.Labels).ToNot(HaveKey(consts.NodeNicConfigurationWaitLabel))
})

It("schedules maintenance and sets the wait label; second call is idempotent", func() {
node := &corev1.Node{ObjectMeta: metav1.ObjectMeta{Name: nodeName}}
cl := fake.NewClientBuilder().WithScheme(scheme).WithObjects(node).Build()
m := maintenanceManager{client: cl, nodeName: nodeName, namespace: namespace}

// first schedule creates one object and sets wait label true
Expect(m.ScheduleMaintenance(ctx)).To(Succeed())

nmList := &maintenanceoperator.NodeMaintenanceList{}
Expect(cl.List(ctx, nmList, clientInNamespace(namespace))).To(Succeed())
Expect(nmList.Items).To(HaveLen(1))
Expect(nmList.Items[0].Spec.NodeName).To(Equal(nodeName))
Expect(nmList.Items[0].Spec.RequestorID).To(Equal(consts.MaintenanceRequestor))

updated := &corev1.Node{}
Expect(cl.Get(ctx, types.NamespacedName{Name: nodeName}, updated)).To(Succeed())
Expect(updated.Labels).To(HaveKeyWithValue(consts.NodeNicConfigurationWaitLabel, consts.LabelValueTrue))

// second schedule is a no-op and label remains true
Expect(m.ScheduleMaintenance(ctx)).To(Succeed())
nmList = &maintenanceoperator.NodeMaintenanceList{}
Expect(cl.List(ctx, nmList, clientInNamespace(namespace))).To(Succeed())
Expect(nmList.Items).To(HaveLen(1))
Expect(cl.Get(ctx, types.NamespacedName{Name: nodeName}, updated)).To(Succeed())
Expect(updated.Labels).To(HaveKeyWithValue(consts.NodeNicConfigurationWaitLabel, consts.LabelValueTrue))
})

It("reports maintenance allowed only when Ready condition is true", func() {
node := &corev1.Node{ObjectMeta: metav1.ObjectMeta{Name: nodeName}}
cl := fake.NewClientBuilder().WithScheme(scheme).WithObjects(node).Build()
m := maintenanceManager{client: cl, nodeName: nodeName, namespace: namespace}

// no object
allowed, err := m.MaintenanceAllowed(ctx)
Expect(err).To(BeNil())
Expect(allowed).To(BeFalse())

// object without Ready condition
nm := &maintenanceoperator.NodeMaintenance{
ObjectMeta: metav1.ObjectMeta{Name: consts.MaintenanceRequestName + "-" + nodeName, Namespace: namespace},
Spec: maintenanceoperator.NodeMaintenanceSpec{RequestorID: consts.MaintenanceRequestor, NodeName: nodeName},
}
cl = fake.NewClientBuilder().WithScheme(scheme).WithObjects(node, nm).Build()
m.client = cl

allowed, err = m.MaintenanceAllowed(ctx)
Expect(err).To(BeNil())
Expect(allowed).To(BeFalse())

// object with Ready=false
nm.Status.Conditions = []metav1.Condition{{Type: maintenanceoperator.ConditionTypeReady, Status: metav1.ConditionFalse}}
cl = fake.NewClientBuilder().WithScheme(scheme).WithObjects(node, nm).Build()
m.client = cl
allowed, err = m.MaintenanceAllowed(ctx)
Expect(err).To(BeNil())
Expect(allowed).To(BeFalse())

// object with Ready=true
nm.Status.Conditions = []metav1.Condition{{Type: maintenanceoperator.ConditionTypeReady, Status: metav1.ConditionTrue}}
cl = fake.NewClientBuilder().WithScheme(scheme).WithObjects(node, nm).Build()
m.client = cl
allowed, err = m.MaintenanceAllowed(ctx)
Expect(err).To(BeNil())
Expect(allowed).To(BeTrue())
})

It("releases maintenance and clears the wait label when present", func() {
node := &corev1.Node{ObjectMeta: metav1.ObjectMeta{Name: nodeName}}
nm := &maintenanceoperator.NodeMaintenance{
ObjectMeta: metav1.ObjectMeta{Name: consts.MaintenanceRequestName + "-" + nodeName, Namespace: namespace},
Spec: maintenanceoperator.NodeMaintenanceSpec{RequestorID: consts.MaintenanceRequestor, NodeName: nodeName},
}
cl := fake.NewClientBuilder().WithScheme(scheme).WithObjects(node, nm).Build()
m := maintenanceManager{client: cl, nodeName: nodeName, namespace: namespace}

// ensure label is set true first (simulate schedule)
Expect(m.SetNodeWaitLabel(ctx, consts.LabelValueTrue)).To(Succeed())

// release maintenance should delete object and set label false
Expect(m.ReleaseMaintenance(ctx)).To(Succeed())
nmList := &maintenanceoperator.NodeMaintenanceList{}
Expect(cl.List(ctx, nmList, clientInNamespace(namespace))).To(Succeed())
Expect(nmList.Items).To(HaveLen(0))

updated := &corev1.Node{}
Expect(cl.Get(ctx, types.NamespacedName{Name: nodeName}, updated)).To(Succeed())
Expect(updated.Labels).To(HaveKeyWithValue(consts.NodeNicConfigurationWaitLabel, consts.LabelValueFalse))
})

It("calls host utils to reboot and propagates errors", func() {
mockHU := &hostmocks.HostUtils{}
mockHU.On("ScheduleReboot").Return(nil).Once()
m := maintenanceManager{hostUtils: mockHU}
Expect(m.Reboot()).To(Succeed())
mockHU.AssertExpectations(GinkgoT())

mockHU2 := &hostmocks.HostUtils{}
rebootErr := fmt.Errorf("reboot failed")
mockHU2.On("ScheduleReboot").Return(rebootErr).Once()
m = maintenanceManager{hostUtils: mockHU2}
Expect(m.Reboot()).To(MatchError(rebootErr))
mockHU2.AssertExpectations(GinkgoT())
})
})

// helpers
func clientInNamespace(ns string) clientListOptionInNamespace {
return clientListOptionInNamespace{Namespace: ns}
}

type clientListOptionInNamespace struct{ Namespace string }

func (o clientListOptionInNamespace) ApplyToList(opts *client.ListOptions) {
opts.Namespace = o.Namespace
}
18 changes: 18 additions & 0 deletions pkg/maintenance/mocks/MaintenanceManager.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading