-
Notifications
You must be signed in to change notification settings - Fork 438
Description
hi Team , I am having an issue with RTX4090 and Fedora41. It was working fine since implementation until during an embedding model work for document inference from a container(running in gpu), went into issues as below, fan speed pretty high, but overall temp didn’t exceed 65C(this temp was only seen at this system at the time of issue, temp is normally 24C.
The container runs small Embedding model for embedding documents into a vector database. Same type of loads runs pretty normal at a T4 or A10G.
No monitor attached.
root@fedora41:~# nvidia-smi
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error
root@fedora41:~# nvidia-debugdump --dumpall
ERROR: GetCaptureBufferSize failed, GPU is lost, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0xf
ERROR: internal_dumpSystemComponent() failed, return code: 0xf
ERROR: GetCaptureBufferSize failed, GPU is lost, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0xf
ERROR: internal_dumpSystemComponent() failed, return code: 0xf
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
/etc/modprobe.d# cat nvidia.conf
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia-drm modeset=1 fbdev=1
nvidia-bug-report.log.gz (608.3 KB)
user@fedora41:~$ nvidia-ctk cdi generate --device-name-strategy=uuid --output cdi-spec.yaml
INFO[0000] Using /usr/lib64/libnvidia-ml.so.565.77
INFO[0000] Using /usr/lib64/libnvidia-sandboxutils.so.565.77
INFO[0000] Auto-detected mode as ‘nvml’
INFO[0000] Using driver version 565.77
WARN[0000] Could not locate /dev/nvidia-modeset: pattern /dev/nvidia-modeset not found
INFO[0000] Selecting /dev/nvidia-uvm-tools as /dev/nvidia-uvm-tools
INFO[0000] Selecting /dev/nvidia-uvm as /dev/nvidia-uvm
INFO[0000] Selecting /dev/nvidiactl as /dev/nvidiactl
INFO[0000] Selecting /usr/lib64/libnvidia-egl-gbm.so.1.1.2 as /usr/lib64/libnvidia-egl-gbm.so.1.1.2
INFO[0000] Selecting /usr/lib64/libnvidia-egl-wayland.so.1.1.17 as /usr/lib64/libnvidia-egl-wayland.so.1.1.17
INFO[0000] Selecting /usr/lib64/libnvidia-allocator.so.565.77 as /usr/lib64/libnvidia-allocator.so.565.77
WARN[0000] Could not locate libnvidia-vulkan-producer.so.565.77: pattern libnvidia-vulkan-producer.so.565.77 not found
libnvidia-vulkan-producer.so.565.77: not found
INFO[0000] Selecting /usr/lib64/xorg/modules/drivers/nvidia_drv.so as /usr/lib64/xorg/modules/drivers/nvidia_drv.so
INFO[0000] Selecting /usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.565.77 as /usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.565.77
INFO[0000] Selecting /usr/share/glvnd/egl_vendor.d/10_nvidia.json as /usr/share/glvnd/egl_vendor.d/10_nvidia.json
INFO[0000] Selecting /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json as /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json
INFO[0000] Selecting /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json as /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
INFO[0000] Selecting /usr/share/nvidia/nvoptix.bin as /usr/share/nvidia/nvoptix.bin
WARN[0000] Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found
INFO[0000] Selecting /usr/share/X11/xorg.conf.d/nvidia-drm-outputclass.conf as /usr/share/X11/xorg.conf.d/nvidia-drm-outputclass.conf
INFO[0000] Selecting /etc/vulkan/icd.d/nvidia_icd.json as /etc/vulkan/icd.d/nvidia_icd.json
WARN[0000] Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found
pattern vulkan/icd.d/nvidia_layers.json not found
INFO[0000] Selecting /etc/vulkan/implicit_layer.d/nvidia_layers.json as /etc/vulkan/implicit_layer.d/nvidia_layers.json
INFO[0000] Selecting /usr/lib64/libEGL_nvidia.so.565.77 as /usr/lib64/libEGL_nvidia.so.565.77
INFO[0000] Selecting /usr/lib64/libGLESv1_CM_nvidia.so.565.77 as /usr/lib64/libGLESv1_CM_nvidia.so.565.77
INFO[0000] Selecting /usr/lib64/libGLESv2_nvidia.so.565.77 as /usr/lib64/libGLESv2_nvidia.so.565.77
INFO[0000] Selecting /usr/lib64/libGLX_nvidia.so.565.77 as /usr/lib64/libGLX_nvidia.so.565.77
INFO[0000] Selecting /usr/lib64/libcuda.so.565.77 as /usr/lib64/libcuda.so.565.77
INFO[0000] Selecting /usr/lib64/libcudadebugger.so.565.77 as /usr/lib64/libcudadebugger.so.565.77
INFO[0000] Selecting /usr/lib64/libnvcuvid.so.565.77 as /usr/lib64/libnvcuvid.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-allocator.so.565.77 as /usr/lib64/libnvidia-allocator.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-cfg.so.565.77 as /usr/lib64/libnvidia-cfg.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-eglcore.so.565.77 as /usr/lib64/libnvidia-eglcore.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-encode.so.565.77 as /usr/lib64/libnvidia-encode.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-fbc.so.565.77 as /usr/lib64/libnvidia-fbc.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-glcore.so.565.77 as /usr/lib64/libnvidia-glcore.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-glsi.so.565.77 as /usr/lib64/libnvidia-glsi.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-glvkspirv.so.565.77 as /usr/lib64/libnvidia-glvkspirv.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-gpucomp.so.565.77 as /usr/lib64/libnvidia-gpucomp.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-gtk2.so.565.77 as /usr/lib64/libnvidia-gtk2.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-gtk3.so.565.77 as /usr/lib64/libnvidia-gtk3.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-ml.so.565.77 as /usr/lib64/libnvidia-ml.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-ngx.so.565.77 as /usr/lib64/libnvidia-ngx.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-nvvm.so.565.77 as /usr/lib64/libnvidia-nvvm.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-opencl.so.565.77 as /usr/lib64/libnvidia-opencl.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-opticalflow.so.565.77 as /usr/lib64/libnvidia-opticalflow.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-pkcs11-openssl3.so.565.77 as /usr/lib64/libnvidia-pkcs11-openssl3.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-pkcs11.so.565.77 as /usr/lib64/libnvidia-pkcs11.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-ptxjitcompiler.so.565.77 as /usr/lib64/libnvidia-ptxjitcompiler.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-rtcore.so.565.77 as /usr/lib64/libnvidia-rtcore.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-sandboxutils.so.565.77 as /usr/lib64/libnvidia-sandboxutils.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-tls.so.565.77 as /usr/lib64/libnvidia-tls.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-vksc-core.so.565.77 as /usr/lib64/libnvidia-vksc-core.so.565.77
INFO[0000] Selecting /usr/lib64/libnvidia-wayland-client.so.565.77 as /usr/lib64/libnvidia-wayland-client.so.565.77
INFO[0000] Selecting /usr/lib64/libnvoptix.so.565.77 as /usr/lib64/libnvoptix.so.565.77
INFO[0000] Selecting /usr/lib64/vdpau/libvdpau_nvidia.so.565.77 as /usr/lib64/vdpau/libvdpau_nvidia.so.565.77
WARN[0000] Could not locate /nvidia-persistenced/socket: pattern /nvidia-persistenced/socket not found
WARN[0000] Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found
WARN[0000] Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found
INFO[0000] Selecting /lib/firmware/nvidia/565.77/gsp_ga10x.bin as /lib/firmware/nvidia/565.77/gsp_ga10x.bin
INFO[0000] Selecting /lib/firmware/nvidia/565.77/gsp_tu10x.bin as /lib/firmware/nvidia/565.77/gsp_tu10x.bin
INFO[0000] Selecting /usr/bin/nvidia-smi as /usr/bin/nvidia-smi
INFO[0000] Selecting /usr/bin/nvidia-debugdump as /usr/bin/nvidia-debugdump
INFO[0000] Selecting /usr/bin/nvidia-persistenced as /usr/bin/nvidia-persistenced
INFO[0000] Selecting /usr/bin/nvidia-cuda-mps-control as /usr/bin/nvidia-cuda-mps-control
INFO[0000] Selecting /usr/bin/nvidia-cuda-mps-server as /usr/bin/nvidia-cuda-mps-server
INFO[0000] Generated CDI spec with version 0.8.0
Thanks!