-
Notifications
You must be signed in to change notification settings - Fork 496
Description
Problem
When deploying HAMi on EKS with Bottlerocket nodes, devicePlugin.libPath must be set to a writable path such as /var/lib/hami/vgpu because Bottlerocket has a read-only /usr/local.
vgpu-init.sh (postStart lifecycle hook) copies all files from /k8s-vgpu/lib/nvidia/ to libPath on the host. The ld.so.preload file in the image hardcodes /usr/local/vgpu/libvgpu.so regardless of the configured libPath. After every device plugin pod restart, the host ld.so.preload is overwritten with the wrong path (or left wrong if the file already exists with matching MD5 from a previous hardcoded copy), causing libvgpu.so to fail to preload in every workload container.
This was surfaced during Bottlerocket deployment investigation, related to #969 and #971.
Reproduction
- Deploy HAMi on EKS with Bottlerocket (
aws-k8s-1.33-nvidiavariant) - Set
devicePlugin.libPath: /var/lib/hami/vgpu(required —/usr/localis read-only on Bottlerocket) - After device plugin pod starts:
kubectl exec -n <ns> <device-plugin-pod> -c device-plugin -- \ cat /var/lib/hami/vgpu/ld.so.preload # Output: /usr/local/vgpu/libvgpu.so ← wrong path - Any workload pod requesting
nvidia.com/gpumemgets:ERROR: ld.so: object '/usr/local/vgpu/libvgpu.so' from /etc/ld.so.preload cannot be preloaded
Root cause
/k8s-vgpu/lib/nvidia/ld.so.preload in the image contains /usr/local/vgpu/libvgpu.so hardcoded. vgpu-init.sh copies it as-is using MD5-based diffing. The chart renders libPath correctly into env vars and volume mounts but never writes the correct path into ld.so.preload.
Proposed fix
Add ld.so.preload as a data key in the existing device-plugin ConfigMap, rendered from {{ .Values.devicePlugin.libPath }}/libvgpu.so. Mount it into the device-plugin container at /k8s-vgpu/lib/nvidia/ld.so.preload using subPath on the existing deviceconfig volume. vgpu-init.sh's MD5-based copy logic then picks up the correct path from the ConfigMap instead of the image's hardcoded version. No new Kubernetes resources are required.