-
Notifications
You must be signed in to change notification settings - Fork 66
Description
What happened:
When using GPU Direct Storage with ENABLE_NFSRDMA or ENABLE_NFSRDMA_NO_NVME if the nic cluster policy is removed while it unloads the out of tree drivers and loads back in the OpenShift/RHCOS in tree drivers it fails to load rpcrdma back up.
rpcrdma on a default install of OpenShift would be loaded. When we install a nic cluster policy it has to unload drivers and dependencies like rpcrdma. Then it loads up the out of tree drivers and dependencies like rpcrdma. This provides the ability for GPU Direct Storage related to NFS. However when one goes to remove the nicclusterpolicy to return the system back to default install state while the original drivers are reloaded rpcrdma is not. It either has to be manually loaded with modprobe or the node needs to be rebooted.
What you expected to happen:
The expectation when there is any load and unload of drivers and their dependencies is that the dependency state is returned to what it was.
How to reproduce it (as minimally and precisely as possible):
Install OpenShift, deploy 25.7 and nic cluster policy with ENABLE_NFSRDMA, then remove the nic cluster policy and observer that if you do a lsmod|grep rpcrdma in mofed container there will be no output.
Anything else we need to know?:
Logs:
On request
Environment:
Dell R760xa
Mellanox CX7 or BF3
Network Operator 25.7
OpenShift 4.19