FabricManager doesn't get started correctly (and no error handling)

**Describe the bug**

It seems that the NVIDIA fabricmanager doesn't get started correctly, merely due to an invalid option being passed?


Here's a snippet from the logs of `nvidia-gpu-operator/nvidia-driver-daemonset-5.15.0-151-generic-ubuntu22.04-dcl4t`:
```
Starting NVIDIA fabric manager daemon for NVLink5+...
Unknown option: /usr/share/nvidia/nvswitch/fabricmanager.cfg
NVIDIA Fabric Manager Start Script
Warning: If you are not a fabric manager developer, please avoid running this script directly.
Please either start or stop using the nvidia-fabricmanager systemd service,
or refer to the fabric manager user guide for how to run the fabric manager program.

Usage: /usr/bin/nvidia-fabricmanager-start.sh [script options] --args [arguments to run fabric manager]
Script options:
  -h, --help              Show this help message
  -d, --debug             Enable debug output for this script
  -m, --mode              Mode to run this script (start|stop|interactive)
  --fm-config-file        Path to fabric manager configuration file
  --fm-pid-file           Path to fabric manager PID file
  --nvlsm-config-file     Path to NVLink Subnet Manager configuration file
  --nvlsm-pid-file        Path to NVLink Subnet Manager PID file
  --args                  Trailing arguments to run fabric manager
Mounting NVIDIA driver rootfs...
Done, now waiting for signal
```

**To Reproduce**
Deploy the GPU Operator on k3s on an NVLink system.
```yaml
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: nvidia-gpu-operator
  namespace: default
spec:
  chart: gpu-operator
  repo: https://helm.ngc.nvidia.com/nvidia
  version: v25.3.2
  targetNamespace: nvidia-gpu-operator
  createNamespace: true
  valuesContent: |-
    cdi:
      enabled: true
      default: false
    driver:
      enabled: true
      usePrecompiled: true
      version: 570

    mig:
      strategy: single
    toolkit:
      enabled: false
```

**Expected behavior**
- Fabricmanager starts up and configures the system.
- Or, there's at least an error (`bash -euo pipefail etc`)

Instead, I just get CUDA errors from my containers.

I've verified this was the issue by running `nvidia-fabricmanager` outside k3s:
```shell
apt install nvidia-container-toolkit nvidia-container-runtime nvidia-fabricmanager
systemctl enable --now nvidia-fabricmanager
```
After that, my workloads started to run normally.

**Environment (please provide the following information):**
 - GPU Operator Version: v25.3.2
 - OS: Ubuntu 22.04 LTS
 - Kernel Version: `5.15.0-151-generic`
 - Container Runtime Version:
    ```
    # k3s ctr version
    Client:
      Version:  v2.0.5-k3s2
      Revision:
      Go version: go1.24.4
    
    Server:
      Version:  v2.0.5-k3s2
      Revision:
      UUID: 8acbe27a-7e75-4ca6-98d2-e669fb7499b2
    ```
 - Kubernetes Distro and Version: k3s
    ```
    # k3s --version
    k3s version v1.33.3+k3s1 (236cbf25)
    go version go1.24.4
    ```
    


**Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/)** (optional if deemed irrelevant)

 - [x] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
 - [x] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 - [ ] ~~If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`~~
 - [ ] ~~If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`~~
 - [x] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`

[scratch_26.txt](https://github.com/user-attachments/files/21740245/scratch_26.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FabricManager doesn't get started correctly (and no error handling) #1595

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FabricManager doesn't get started correctly (and no error handling) #1595

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions