Skip to content

Conversation

@heyvister1
Copy link
Contributor

@heyvister1 heyvister1 commented Oct 9, 2025

SRIOV OP internal drain controller can be disabled, through USE_EXTERNAL_DRAINER. For example draining can be performed by external NVIDIA maintenance operator

@github-actions
Copy link

github-actions bot commented Oct 9, 2025

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@github-actions github-actions bot added the docs label Oct 9, 2025
@heyvister1 heyvister1 force-pushed the disable-drain-controller branch 2 times, most recently from 30699af to cc4a257 Compare October 9, 2025 09:52
@coveralls
Copy link

coveralls commented Oct 9, 2025

Pull Request Test Coverage Report for Build 19669073220

Details

  • 14 of 21 (66.67%) changed or added relevant lines in 4 files are covered.
  • 15 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.04%) to 61.986%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/daemon/daemon.go 10 13 76.92%
controllers/drain_controller.go 1 5 20.0%
Files with Coverage Reduction New Missed Lines %
controllers/helper.go 2 70.61%
pkg/utils/cluster.go 2 85.57%
pkg/daemon/daemon.go 11 44.16%
Totals Coverage Status
Change from base Build 19668853053: -0.04%
Covered Lines: 8763
Relevant Lines: 14137

💛 - Coveralls

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I am fine with having this here. but can we please have a more generic name for the variable?


// UseMaintenanceOperatorDrainer indicates if internal drain controller is disabled
// and draining will be done by external NVIDIA maintenance operator
func UseMaintenanceOperatorDrainer() bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please have this in the vars and consts folder and not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@heyvister1 heyvister1 force-pushed the disable-drain-controller branch from cc4a257 to 62632d8 Compare October 16, 2025 12:19
@heyvister1 heyvister1 force-pushed the disable-drain-controller branch from 62632d8 to 237e08b Compare October 16, 2025 12:25
@heyvister1 heyvister1 force-pushed the disable-drain-controller branch 2 times, most recently from 16474e5 to cd01dcf Compare October 16, 2025 14:52
@heyvister1 heyvister1 changed the title Support node draining by external NVIDIA maintenance OP Setting option for node draining by external controllers Oct 16, 2025
@github-actions github-actions bot added the tests label Oct 17, 2025
@heyvister1 heyvister1 force-pushed the disable-drain-controller branch from a7fbd43 to c3d7edf Compare October 17, 2025 08:23
@heyvister1 heyvister1 force-pushed the disable-drain-controller branch 4 times, most recently from daaf0f6 to 7fed536 Compare October 17, 2025 09:48
@heyvister1
Copy link
Contributor Author

/test-all


// add external drainer annotation if enabled
if vars.UseExternalDrainer {
if err := utils.AnnotateNode(ctx,
Copy link
Collaborator

@adrianchiris adrianchiris Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we annotate nodeState instead ?

as we store related drain state in nodeState obj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

func setupDrainController(mgr ctrl.Manager, restConfig *rest.Config,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should take a bit different approach.

we could always start the drain controller and then skip drain requests if use-external-drainer annotation is set.

that way, if there are any "in-flight" drains they will complete even if we switched on the external-drainer functionality.

WDYT ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using the same annotations as the regular drain flow?

  • the daemon would set the sriovnetworknodestate.DesiredState == "Drain_Required"
  • the in-operator drainer would do nothing
  • the external drainer would drain the node using its own logic, then set the node.CurrentState=DrainComplete

would it be a cleaner implementation?

@heyvister1 heyvister1 force-pushed the disable-drain-controller branch from 7fed536 to ade293d Compare October 19, 2025 10:13
@github-actions github-actions bot added the ci label Oct 20, 2025
@heyvister1 heyvister1 force-pushed the disable-drain-controller branch from b2877d5 to 40509de Compare October 20, 2025 08:53
@github-actions github-actions bot removed the ci label Oct 20, 2025
@heyvister1 heyvister1 force-pushed the disable-drain-controller branch 3 times, most recently from 62ddbb0 to 24a2bfe Compare October 20, 2025 10:02
Copy link
Collaborator

@e0ne e0ne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@heyvister1 heyvister1 force-pushed the disable-drain-controller branch 6 times, most recently from b8c3b8c to 721a627 Compare October 24, 2025 06:04
@adrianchiris
Copy link
Collaborator

/test-all

@heyvister1 heyvister1 force-pushed the disable-drain-controller branch 3 times, most recently from 914739d to 30a21f9 Compare October 28, 2025 12:40

*NOTE:* In the future we are going to drop the node annotation and only use the SriovNetworkNodeState

*NOTE:* Internal drain controller can be disabled by exposing the following `USE_EXTERNAL_DRAINER` env variable. This means that drain operations will be done externally, for example by utilizing [NVIDIA maintenance OP](https://github.com/Mellanox/maintenance-operator). In addition, `SriovNetworkPoolConfig` will not take any effect during drain procedure, since the maintenance operator will be in charge of parallel node operations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think we ever talked about changes to existing design docs if new addition changes that feature's behaviour.
i guess its ok but it might pose unmanageable in the future.

@SchSeba @zeeke thoughts ? should we have this addition here or just in main README ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine adding that change also here not only in the README and update the updatedDate in the doc.

func setupDrainController(mgr ctrl.Manager, restConfig *rest.Config,
platformsHelper platforms.Interface, scheme *runtime.Scheme) error {
if vars.UseExternalDrainer {
setupLog.Info("'UseExternalDrainer' is set, draining will be done externally")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a note on why we still setup the drain controller here even if UseExternalDrainer is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more complicated.

if we are in the middle, and we update the config-daemon yaml it will start new pods that will add the label.
in this case we will be in the middle of configuration with the 2 labels in parallel

Copy link
Collaborator

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM added minor nit to clarify why we need to start drain controller when external drainer is used.

@heyvister1 heyvister1 force-pushed the disable-drain-controller branch from 30a21f9 to fd9607c Compare November 10, 2025 06:51
}

func createNode(ctx context.Context, nodeName string) (*corev1.Node, *sriovnetworkv1.SriovNetworkNodeState) {
func createNode(ctx context.Context, nodeName string, useExternalDrainer bool) (*corev1.Node, *sriovnetworkv1.SriovNetworkNodeState) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, avoid the boolean parameter as it reduce readability.
as alternatives:
a. change the parameter to additionalAnnotation map[string]string
b. add a tweak parameter like

tweak func(*corev1.Node, *sriovnetworkv1.SriovNetworkNodeState) so that a function can customize the creation of the node objects

c. remove the parameter and update the k8s objects after the creation

I prefer b, but the others would work too

Name: nodeName,
Namespace: vars.Namespace,
Labels: map[string]string{
Annotations: map[string]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea about how these tests were working before this line change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I remember right the controller adds them if they don't exist.

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say I don't follow this sorry.

the daemon will still request drain and this check https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/952/files#diff-a53b7b593d3d778e62eaeeafa40088656f9212bfa2c2b7991df15fa78e60b0f0R256
will never pass.

also you can have a race as we do the annotation update twice in

	// add external drainer nodestate annotation if flag is enabled
	if vars.UseExternalDrainer {
		err := utils.AnnotateObject(ctx, desiredNodeState,
			consts.NodeStateExternalDrainerAnnotation, "true", dn.client)
		if err != nil {
			funcLog.Error(err, "failed to add nodestate external drainer annotation")
			return false, err
		}
	}

	// annotate both node and node state with drain or reboot
	annotation := consts.DrainRequired
	if reqReboot {
		annotation = consts.RebootRequired
	}
	return true, dn.annotate(ctx, desiredNodeState, annotation)
}

and I don't understand how the daemon will know that he is able to continue with the configuration because the drain was done.

Name: nodeName,
Namespace: vars.Namespace,
Labels: map[string]string{
Annotations: map[string]string{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I remember right the controller adds them if they don't exist.


// remove external drainer nodestate annotation if exists
annotations := desiredNodeState.GetAnnotations()
if _, ok := annotations[consts.NodeStateExternalDrainerAnnotation]; ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why we need this here?

func setupDrainController(mgr ctrl.Manager, restConfig *rest.Config,
platformsHelper platforms.Interface, scheme *runtime.Scheme) error {
if vars.UseExternalDrainer {
setupLog.Info("'UseExternalDrainer' is set, draining will be done externally")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more complicated.

if we are in the middle, and we update the config-daemon yaml it will start new pods that will add the label.
in this case we will be in the middle of configuration with the 2 labels in parallel

…sable SRIOV OP drain controller, in favor of using maintenance OP to drive node drain aspects

Signed-off-by: Ido Heyvi <[email protected]>
…rnal-drainer=true' in case exteranl drainer is enabled

The motivation is for external drainer verification, that SRIOV operator is set with external drainer
Signed-off-by: Ido Heyvi <[email protected]>
@heyvister1 heyvister1 force-pushed the disable-drain-controller branch from fd9607c to c050234 Compare November 25, 2025 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants