Skip to content

feat: Add AMD Instinct GPU isolation (GPU memory limiting + Computing unit (CU) partitioning) #1707

@kenji-mido

Description

@kenji-mido

What would you like to be added:
AMD Instinct GPU isolation support — per-pod memory limiting and compute unit (CU) partitioning for AMD ROCm GPUs.

What type of PR is this?

/kind feature

After discussion, I will prepare for the code contribution.

What this PR does / why we need it:

I have a working prototype tested on AMD Instinct MI300X (192GB HBM3e) across ROCm 6.2, 7.0, 7.1, and 7.2. I'd like to discuss the approach before opening PRs.

HAMi-core (libvgpu): libamvgpu.so

  • LD_AUDIT library intercepting HIP API calls
  • hipMalloc/hipFree tracking with per-pod memory limits
  • hipMemGetInfo virtualization (reports HAMi limit, not physical)
  • Shared memory region for pod-level cumulative usage tracking with multiple processes in a pod.

HAMi (device-plugin + scheduler)

  • AMD device plugin: amd.com/gpumem (MB) + amd.com/gpucores (CU count)
  • CU bitmap scheduler for exclusive CU partitioning across pods
  • ROC_GLOBAL_CU_MASK and LD_AUDIT injection into pods

Why LD_AUDIT instead of LD_PRELOAD

ROCm 7.x HIP's internal symbol resolution breaks with LD_PRELOAD due to recursive interception of HIP-internal calls. LD_AUDIT's la_symbind64 intercepts only cross-library bindings, avoiding this
issue. The existing NVIDIA LD_PRELOAD path is unchanged.

Test results

Tested on AMD Developer Cloud with some kinds of inference servers like vLLM, Ollama, TGI, llama.cpp and SGLang.

Special notes for your reviewer:

Discussion points

  1. LD_AUDIT vs LD_PRELOAD: Is a separate LD_AUDIT code path acceptable for AMD, or should we try to unify with the existing LD_PRELOAD approach? First I tried with LD_PRELOAD approach, but move to the current implementation after repeated errors.
  2. PR structure: Should I submit HAMi-core and HAMi changes together, or as separate PRs?
  3. amd-smi / rocm-smi: These tools use sysfs/drm (not HIP) and cannot be virtualized via LD_AUDIT.

Does this PR introduce a user-facing change?:

New resource types: amd.com/gpumem (MB), amd.com/gpucores (CU count)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions