Skip to content

Conversation

@adilGhaffarDev
Copy link
Contributor

@adilGhaffarDev adilGhaffarDev commented Mar 24, 2025

This PR is changing how SetupVF sets up a VF in Pod netns.
Previous Approach:

  • Set link down
  • Set temp name on the link
  • Remove alt name
  • Change netns (move link to pod netns)
  • Rename link in netns to Pod interface name
  • … do other operations on the link in netns (Enable IPv4 ARP, Set MAC address, Enable Optimistic DAD etc)
  • Bring IF up in Pod netns

New approach:

  • Create new tempNS
  • Move link to tempNS
  • Rename the link in tempNS to pod interface name
  • Remove alt name from the link in tempNS
  • Move link in tempNS to pod netns
  • … do other operations on the link in netns (Enable IPv4 ARP, Set MAC address, Enable Optimistic DAD etc)
  • Bring IF up in Pod netns

Previously we were setting temp name of the link now we move the link to tempNS.
Why we are doing this?
Check this issue:
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1599
Rapid renaming of link causes race condition explained in the issue please check.
To solve the issue they used tempNS instead of renaming link: containernetworking/plugins#1073
We are trying to do the same in this PR.

Fixes #321

@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 3 times, most recently from 0b97f8d to 0ffa338 Compare March 27, 2025 10:54
@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 3 times, most recently from fdf11d7 to 14b8542 Compare April 15, 2025 08:04
@SchSeba
Copy link
Collaborator

SchSeba commented Apr 15, 2025

Hi @adilGhaffarDev can you please fix the unit-tests please

@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 5 times, most recently from e89ae1f to a66752c Compare April 15, 2025 10:00
@adilGhaffarDev
Copy link
Contributor Author

Hi @adilGhaffarDev can you please fix the unit-tests please

I am working on them, I want to run the tests locally but it gives lot of permission related errors. example error:

  [FAILED] Unexpected error:
      <*errors.errorString | 0xc00030c730>: 
      failed to create the sriov lock directory("/tmp/ginkgo2916429793/pci/vf_lock"): mkdir /tmp/ginkgo2916429793/pci/vf_lock: permission denied
      {
          s: "failed to create the sriov lock directory(\"/tmp/ginkgo2916429793/pci/vf_lock\"): mkdir /tmp/ginkgo2916429793/pci/vf_lock: permission denied",
      }
  occurred
  In [It] at: /home/metal3ci/sriov-cni/pkg/config/config_test.go:38

Is it possible to run these tests locally? in VM?

@adilGhaffarDev
Copy link
Contributor Author

Hi @adilGhaffarDev can you please fix the unit-tests please

I am working on them, I want to run the tests locally but it gives lot of permission related errors. example error:

  [FAILED] Unexpected error:
      <*errors.errorString | 0xc00030c730>: 
      failed to create the sriov lock directory("/tmp/ginkgo2916429793/pci/vf_lock"): mkdir /tmp/ginkgo2916429793/pci/vf_lock: permission denied
      {
          s: "failed to create the sriov lock directory(\"/tmp/ginkgo2916429793/pci/vf_lock\"): mkdir /tmp/ginkgo2916429793/pci/vf_lock: permission denied",
      }
  occurred
  In [It] at: /home/metal3ci/sriov-cni/pkg/config/config_test.go:38

Is it possible to run these tests locally? in VM?

Nevermind it just needed sudo.

@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 2 times, most recently from e213f83 to dd28ba8 Compare April 16, 2025 09:39
@adilGhaffarDev
Copy link
Contributor Author

Hi @adilGhaffarDev can you please fix the unit-tests please

All tests are now passing. Please take a look at the test case "should save all the netConf struct to the cache file without dns" in pkg/utils/utils_test.go. It was previously failing, although I hadn’t modified any related code. I’ve made some adjustments to address the failure, but I’m not entirely sure why it was failing in the first place.

@adilGhaffarDev
Copy link
Contributor Author

@adrianchiris @SchSeba please check

@adilGhaffarDev adilGhaffarDev changed the title use temp network namespace for rename Use temp network namespace for rename Apr 23, 2025
@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 2 times, most recently from 82d6540 to d765377 Compare April 23, 2025 11:50
@SchSeba
Copy link
Collaborator

SchSeba commented Apr 23, 2025

Hi @adilGhaffarDev,
I can see the build is still failing please take a look :)

@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 2 times, most recently from 7587548 to 5c28d18 Compare April 24, 2025 08:13
@coveralls
Copy link

coveralls commented Apr 24, 2025

Pull Request Test Coverage Report for Build 19429187032

Details

  • 43 of 86 (50.0%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.8%) to 62.135%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/sriov/sriov.go 43 86 50.0%
Files with Coverage Reduction New Missed Lines %
pkg/sriov/sriov.go 1 66.26%
Totals Coverage Status
Change from base Build 19405086786: -0.8%
Covered Lines: 978
Relevant Lines: 1574

💛 - Coveralls

@adilGhaffarDev
Copy link
Contributor Author

I can see the build is still failing please take a look :)

Now all issues are fixed 😄

Copy link
Contributor

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review. i still need to take a look at ReleaseVF

_ = s.utils.EnableArpAndNdiscNotify(podifName)
"tempNSObj", tempNSObj,
"netns.Fd()", int(netns.Fd()))
if err := s.nLink.LinkSetNsFd(tempNSObj, int(netns.Fd())); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above we should use: err = ....

@zeeke
Copy link
Member

zeeke commented Sep 22, 2025

hi @adilGhaffarDev , I need a little clarification on the problem this PR is trying to solve, as I got no answer in #321 and the commit description does not describe it.

Even with that information, we have to simplify the SetupVF function: with a 5 defer statement insisting on the same err variable, it is pretty hard to determine every possible combination of error.

@adilGhaffarDev
Copy link
Contributor Author

hi @adilGhaffarDev , I need a little clarification on the problem this PR is trying to solve, as I got no answer in #321 and the commit description does not describe it.

Even with that information, we have to simplify the SetupVF function: with a 5 defer statement insisting on the same err variable, it is pretty hard to determine every possible combination of error.

I have added in the description of PR what this PR is doing and why it is needed.
I understand that defers are making it bit confusing but those are safeguards so nothing goes wrong. I am open to suggestions in improving the defer situation. I will also check myself.

@zeeke
Copy link
Member

zeeke commented Oct 2, 2025

my considerations are:
a. this PR removed the step LinkSetDown. I'm not sure why it was in place, and I'm ok with it if no one recalls why it was in place
b. Is the race condition we are trying to solve here about udev rules (like the PR description states)? or is it about having multiple CNI calls insisting on the same temp_XXX name?
c. can we avoid nested netns.Do(...) calls? They make the defer call chain very hard to understand. Can we call them sequentially?

@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 3 times, most recently from fcc708a to 7eaa7c0 Compare October 13, 2025 16:31
@adilGhaffarDev
Copy link
Contributor Author

c. can we avoid nested netns.Do(...) calls? They make the defer call chain very hard to understand. Can we call them sequentially?

I have removed nested netns.Do() calls. Please check, now it's much simpler. Defers are removed, and now the code is easier to understand.

@adilGhaffarDev
Copy link
Contributor Author

@SchSeba @zeeke @adrianchiris . take a look when you get time.
Simplified the code and got rid of confusing defers.

Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for simplifying the code. Left some comments

@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 3 times, most recently from e6bb840 to b3a97c7 Compare October 23, 2025 09:35
@adilGhaffarDev
Copy link
Contributor Author

@zeeke comments are addressed.

@adilGhaffarDev
Copy link
Contributor Author

@SchSeba @adrianchiris looking for reviews.

Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only one comment from my side.
thanks for all your rework

@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 2 times, most recently from 628c3bd to 2cc273e Compare October 27, 2025 09:08
Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for having addressed all my comments

@adilGhaffarDev
Copy link
Contributor Author

@SchSeba @adrianchiris can we merge this?

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are getting closer I like this solution more it's easier to read and follow I would say :p

@adilGhaffarDev adilGhaffarDev force-pushed the fixes-321/adil branch 4 times, most recently from ddc32b3 to 719aae2 Compare November 12, 2025 09:26
@adilGhaffarDev
Copy link
Contributor Author

@SchSeba comments are addressed.

}

// Try to move interface back to initns
if nsLinkObj, e := s.nLink.LinkByName(linkName); e == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it should be able to find because on line 203 we doing:
linkSetNameError := s.nLink.LinkSetName(nsLinkObj, linkName)
so at 211 we should be able to find the interface.
I am changing it bit to make it more clear.

Signed-off-by: Muhammad Adil Ghaffar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

temp interface name causes race conditions

6 participants