feat: Add AMD Instinct GPU isolation (GPU memory limiting + Computing unit (CU) partitioning)

**What would you like to be added**:
AMD Instinct GPU isolation support — per-pod memory limiting and compute unit (CU) partitioning for AMD ROCm GPUs. 

**What type of PR is this?**

/kind feature

After discussion, I will prepare for the code contribution.

**What this PR does / why we need it**:

I have a working prototype tested on AMD Instinct MI300X (192GB HBM3e) across ROCm 6.2, 7.0, 7.1, and 7.2. I'd like to discuss the approach before opening PRs. 

### HAMi-core (libvgpu): `libamvgpu.so`
  - LD_AUDIT library intercepting HIP API calls 
  - `hipMalloc`/`hipFree` tracking with per-pod memory limits
  - `hipMemGetInfo` virtualization (reports HAMi limit, not physical)
  - Shared memory region for pod-level cumulative usage tracking with multiple processes in a pod.

### HAMi (device-plugin + scheduler)
  - AMD device plugin: `amd.com/gpumem` (MB) + `amd.com/gpucores` (CU count)
  - CU bitmap scheduler for exclusive CU partitioning across pods
  - `ROC_GLOBAL_CU_MASK` and `LD_AUDIT` injection into pods
  
### Why LD_AUDIT instead of LD_PRELOAD
  ROCm 7.x HIP's internal symbol resolution breaks with LD_PRELOAD due to recursive interception of HIP-internal calls. LD_AUDIT's `la_symbind64` intercepts only cross-library bindings, avoiding this      
  issue. The existing NVIDIA LD_PRELOAD path is unchanged.

### Test results
  Tested on AMD Developer Cloud with some kinds of inference servers like vLLM, Ollama, TGI, llama.cpp and SGLang.

**Special notes for your reviewer**:

### Discussion points

  1. **LD_AUDIT vs LD_PRELOAD**: Is a separate LD_AUDIT code path acceptable for AMD, or should we try to unify with the existing LD_PRELOAD approach? First I tried with LD_PRELOAD approach, but move to the current implementation after repeated errors.
  2.  **PR structure**: Should I submit HAMi-core and HAMi changes together, or as separate PRs?
  3. **amd-smi / rocm-smi**: These tools use sysfs/drm (not HIP) and cannot be virtualized via LD_AUDIT.

**Does this PR introduce a user-facing change?**:

 New resource types: `amd.com/gpumem` (MB), `amd.com/gpucores` (CU count)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add AMD Instinct GPU isolation (GPU memory limiting + Computing unit (CU) partitioning) #1707

HAMi-core (libvgpu): `libamvgpu.so`

HAMi (device-plugin + scheduler)

Why LD_AUDIT instead of LD_PRELOAD

Test results

Discussion points

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Add AMD Instinct GPU isolation (GPU memory limiting + Computing unit (CU) partitioning) #1707

Description

HAMi-core (libvgpu): libamvgpu.so

HAMi (device-plugin + scheduler)

Why LD_AUDIT instead of LD_PRELOAD

Test results

Discussion points

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

HAMi-core (libvgpu): `libamvgpu.so`