-
Notifications
You must be signed in to change notification settings - Fork 412
Closed
NVIDIA/gpu-driver-container
#412Description
Describe the bug
It seems that the NVIDIA fabricmanager doesn't get started correctly, merely due to an invalid option being passed?
Here's a snippet from the logs of nvidia-gpu-operator/nvidia-driver-daemonset-5.15.0-151-generic-ubuntu22.04-dcl4t:
Starting NVIDIA fabric manager daemon for NVLink5+...
Unknown option: /usr/share/nvidia/nvswitch/fabricmanager.cfg
NVIDIA Fabric Manager Start Script
Warning: If you are not a fabric manager developer, please avoid running this script directly.
Please either start or stop using the nvidia-fabricmanager systemd service,
or refer to the fabric manager user guide for how to run the fabric manager program.
Usage: /usr/bin/nvidia-fabricmanager-start.sh [script options] --args [arguments to run fabric manager]
Script options:
-h, --help Show this help message
-d, --debug Enable debug output for this script
-m, --mode Mode to run this script (start|stop|interactive)
--fm-config-file Path to fabric manager configuration file
--fm-pid-file Path to fabric manager PID file
--nvlsm-config-file Path to NVLink Subnet Manager configuration file
--nvlsm-pid-file Path to NVLink Subnet Manager PID file
--args Trailing arguments to run fabric manager
Mounting NVIDIA driver rootfs...
Done, now waiting for signal
To Reproduce
Deploy the GPU Operator on k3s on an NVLink system.
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: nvidia-gpu-operator
namespace: default
spec:
chart: gpu-operator
repo: https://helm.ngc.nvidia.com/nvidia
version: v25.3.2
targetNamespace: nvidia-gpu-operator
createNamespace: true
valuesContent: |-
cdi:
enabled: true
default: false
driver:
enabled: true
usePrecompiled: true
version: 570
mig:
strategy: single
toolkit:
enabled: falseExpected behavior
- Fabricmanager starts up and configures the system.
- Or, there's at least an error (
bash -euo pipefail etc)
Instead, I just get CUDA errors from my containers.
I've verified this was the issue by running nvidia-fabricmanager outside k3s:
apt install nvidia-container-toolkit nvidia-container-runtime nvidia-fabricmanager
systemctl enable --now nvidia-fabricmanagerAfter that, my workloads started to run normally.
Environment (please provide the following information):
- GPU Operator Version: v25.3.2
- OS: Ubuntu 22.04 LTS
- Kernel Version:
5.15.0-151-generic - Container Runtime Version:
# k3s ctr version Client: Version: v2.0.5-k3s2 Revision: Go version: go1.24.4 Server: Version: v2.0.5-k3s2 Revision: UUID: 8acbe27a-7e75-4ca6-98d2-e669fb7499b2 - Kubernetes Distro and Version: k3s
# k3s --version k3s version v1.33.3+k3s1 (236cbf25) go version go1.24.4
Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE -
If a pod/ds is in an error state or pending statekubectl describe pod -n OPERATOR_NAMESPACE POD_NAME -
If a pod/ds is in an error state or pending statekubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - containerd logs
journalctl -u containerd > containerd.log
Metadata
Metadata
Assignees
Labels
No labels