-
Notifications
You must be signed in to change notification settings - Fork 496
Description
What would you like to be added:
AMD Instinct GPU isolation support — per-pod memory limiting and compute unit (CU) partitioning for AMD ROCm GPUs.
What type of PR is this?
/kind feature
After discussion, I will prepare for the code contribution.
What this PR does / why we need it:
I have a working prototype tested on AMD Instinct MI300X (192GB HBM3e) across ROCm 6.2, 7.0, 7.1, and 7.2. I'd like to discuss the approach before opening PRs.
HAMi-core (libvgpu): libamvgpu.so
- LD_AUDIT library intercepting HIP API calls
hipMalloc/hipFreetracking with per-pod memory limitshipMemGetInfovirtualization (reports HAMi limit, not physical)- Shared memory region for pod-level cumulative usage tracking with multiple processes in a pod.
HAMi (device-plugin + scheduler)
- AMD device plugin:
amd.com/gpumem(MB) +amd.com/gpucores(CU count) - CU bitmap scheduler for exclusive CU partitioning across pods
ROC_GLOBAL_CU_MASKandLD_AUDITinjection into pods
Why LD_AUDIT instead of LD_PRELOAD
ROCm 7.x HIP's internal symbol resolution breaks with LD_PRELOAD due to recursive interception of HIP-internal calls. LD_AUDIT's la_symbind64 intercepts only cross-library bindings, avoiding this
issue. The existing NVIDIA LD_PRELOAD path is unchanged.
Test results
Tested on AMD Developer Cloud with some kinds of inference servers like vLLM, Ollama, TGI, llama.cpp and SGLang.
Special notes for your reviewer:
Discussion points
- LD_AUDIT vs LD_PRELOAD: Is a separate LD_AUDIT code path acceptable for AMD, or should we try to unify with the existing LD_PRELOAD approach? First I tried with LD_PRELOAD approach, but move to the current implementation after repeated errors.
- PR structure: Should I submit HAMi-core and HAMi changes together, or as separate PRs?
- amd-smi / rocm-smi: These tools use sysfs/drm (not HIP) and cannot be virtualized via LD_AUDIT.
Does this PR introduce a user-facing change?:
New resource types: amd.com/gpumem (MB), amd.com/gpucores (CU count)