Skip to content

FabricManager doesn't get started correctly (and no error handling) #1595

@cfstras

Description

@cfstras

Describe the bug

It seems that the NVIDIA fabricmanager doesn't get started correctly, merely due to an invalid option being passed?

Here's a snippet from the logs of nvidia-gpu-operator/nvidia-driver-daemonset-5.15.0-151-generic-ubuntu22.04-dcl4t:

Starting NVIDIA fabric manager daemon for NVLink5+...
Unknown option: /usr/share/nvidia/nvswitch/fabricmanager.cfg
NVIDIA Fabric Manager Start Script
Warning: If you are not a fabric manager developer, please avoid running this script directly.
Please either start or stop using the nvidia-fabricmanager systemd service,
or refer to the fabric manager user guide for how to run the fabric manager program.

Usage: /usr/bin/nvidia-fabricmanager-start.sh [script options] --args [arguments to run fabric manager]
Script options:
  -h, --help              Show this help message
  -d, --debug             Enable debug output for this script
  -m, --mode              Mode to run this script (start|stop|interactive)
  --fm-config-file        Path to fabric manager configuration file
  --fm-pid-file           Path to fabric manager PID file
  --nvlsm-config-file     Path to NVLink Subnet Manager configuration file
  --nvlsm-pid-file        Path to NVLink Subnet Manager PID file
  --args                  Trailing arguments to run fabric manager
Mounting NVIDIA driver rootfs...
Done, now waiting for signal

To Reproduce
Deploy the GPU Operator on k3s on an NVLink system.

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: nvidia-gpu-operator
  namespace: default
spec:
  chart: gpu-operator
  repo: https://helm.ngc.nvidia.com/nvidia
  version: v25.3.2
  targetNamespace: nvidia-gpu-operator
  createNamespace: true
  valuesContent: |-
    cdi:
      enabled: true
      default: false
    driver:
      enabled: true
      usePrecompiled: true
      version: 570

    mig:
      strategy: single
    toolkit:
      enabled: false

Expected behavior

  • Fabricmanager starts up and configures the system.
  • Or, there's at least an error (bash -euo pipefail etc)

Instead, I just get CUDA errors from my containers.

I've verified this was the issue by running nvidia-fabricmanager outside k3s:

apt install nvidia-container-toolkit nvidia-container-runtime nvidia-fabricmanager
systemctl enable --now nvidia-fabricmanager

After that, my workloads started to run normally.

Environment (please provide the following information):

  • GPU Operator Version: v25.3.2
  • OS: Ubuntu 22.04 LTS
  • Kernel Version: 5.15.0-151-generic
  • Container Runtime Version:
    # k3s ctr version
    Client:
      Version:  v2.0.5-k3s2
      Revision:
      Go version: go1.24.4
    
    Server:
      Version:  v2.0.5-k3s2
      Revision:
      UUID: 8acbe27a-7e75-4ca6-98d2-e669fb7499b2
    
  • Kubernetes Distro and Version: k3s
    # k3s --version
    k3s version v1.33.3+k3s1 (236cbf25)
    go version go1.24.4
    

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

scratch_26.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions