Skip to content

ld.so.preload hardcodes /usr/local/vgpu regardless of devicePlugin.libPath, breaks on Bottlerocket EKS #1713

@ilia-medvedev

Description

@ilia-medvedev

Problem

When deploying HAMi on EKS with Bottlerocket nodes, devicePlugin.libPath must be set to a writable path such as /var/lib/hami/vgpu because Bottlerocket has a read-only /usr/local.

vgpu-init.sh (postStart lifecycle hook) copies all files from /k8s-vgpu/lib/nvidia/ to libPath on the host. The ld.so.preload file in the image hardcodes /usr/local/vgpu/libvgpu.so regardless of the configured libPath. After every device plugin pod restart, the host ld.so.preload is overwritten with the wrong path (or left wrong if the file already exists with matching MD5 from a previous hardcoded copy), causing libvgpu.so to fail to preload in every workload container.

This was surfaced during Bottlerocket deployment investigation, related to #969 and #971.

Reproduction

  1. Deploy HAMi on EKS with Bottlerocket (aws-k8s-1.33-nvidia variant)
  2. Set devicePlugin.libPath: /var/lib/hami/vgpu (required — /usr/local is read-only on Bottlerocket)
  3. After device plugin pod starts:
    kubectl exec -n <ns> <device-plugin-pod> -c device-plugin -- \
      cat /var/lib/hami/vgpu/ld.so.preload
    # Output: /usr/local/vgpu/libvgpu.so   ← wrong path
    
  4. Any workload pod requesting nvidia.com/gpumem gets:
    ERROR: ld.so: object '/usr/local/vgpu/libvgpu.so' from /etc/ld.so.preload cannot be preloaded
    

Root cause

/k8s-vgpu/lib/nvidia/ld.so.preload in the image contains /usr/local/vgpu/libvgpu.so hardcoded. vgpu-init.sh copies it as-is using MD5-based diffing. The chart renders libPath correctly into env vars and volume mounts but never writes the correct path into ld.so.preload.

Proposed fix

Add ld.so.preload as a data key in the existing device-plugin ConfigMap, rendered from {{ .Values.devicePlugin.libPath }}/libvgpu.so. Mount it into the device-plugin container at /k8s-vgpu/lib/nvidia/ld.so.preload using subPath on the existing deviceconfig volume. vgpu-init.sh's MD5-based copy logic then picks up the correct path from the ConfigMap instead of the image's hardcoded version. No new Kubernetes resources are required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions