Skip to content

Merge pr155 some important commits#163

Merged
hami-robot[bot] merged 6 commits intoProject-HAMi:mainfrom
maverick123123:testpr155
Mar 25, 2026
Merged

Merge pr155 some important commits#163
hami-robot[bot] merged 6 commits intoProject-HAMi:mainfrom
maverick123123:testpr155

Conversation

@maverick123123
Copy link
Contributor


Commit 1: Add seqlock for precise memory accounting (c2ae4aa)

Problem: The existing lock-free atomic reads used memory_order_relaxed for reading per-process memory counters. While individual atomic loads are safe, a reader could see a torn snapshot — e.g., reading total
updated by one allocation but context_size not yet updated — leading to inconsistent memory accounting and potentially incorrect OOM decisions.

Solution: Introduces a per-process sequence lock (seqlock) — a _Atomic uint64_t seqlock field added to shrreg_proc_slot_t in the header.

  • Writer side (add_gpu_device_memory_usage, rm_gpu_device_memory_usage): Before updating counters, atomically increments seqlock to an odd value (signaling "write in progress"), performs all counter updates with
    memory_order_release, then increments seqlock to even (signaling "write complete"). This applies to both the fast path (cached my_slot pointer for own PID) and the slow path (linear scan for other PIDs).
  • Reader side (get_gpu_memory_usage): For each process slot, reads seq1, reads the memory counter, inserts an acquire fence, reads seq2. If seq1 != seq2 (write happened during read), retries. If seq1 is odd
    (writer active), spins with exponential backoff: first 5 retries use CPU pause instructions (pause on x86, yield on ARM), then 1μs delays, then 10μs, then 100μs after 100 retries with debug logging.
  • Header change: Added _Atomic uint64_t seqlock to shrreg_proc_slot_t, reduced unused[3] to unused[2] to maintain struct size/alignment. Seqlock is initialized to 0 (even = no write) in init_proc_slot_withlock.

Commit 2: Critical fix — Stale semaphore timeouts causing 300s hangs (fb4f906)

Problem: Processes would hang for 300+ seconds waiting for lock_shrreg(). Root cause: get_timespec() was called once before the while loop. After the first sem_timedwait() timeout, the timestamp was already in
the past, so every subsequent call returned ETIMEDOUT immediately — the process burned through all 30 retries in milliseconds with no actual waiting, then force-posted the semaphore. sem_post() on an
already-unlocked semaphore (value=1) made it value=2, allowing two processes to enter the critical section simultaneously, corrupting delta detection in set_task_pid().

Solution:

  1. Moved get_timespec() inside the while loop in both lock_shrreg() and the new lock_postinit() — each iteration gets a fresh absolute timeout.
  2. New lock_postinit() / unlock_postinit() functions using a dedicated sem_postinit semaphore for serializing host PID detection during postInit(). Uses longer timeouts: SEM_WAIT_TIME_POSTINIT=30s per wait,
    SEM_WAIT_RETRY_TIMES_POSTINIT=10 retries (total 300s max), since set_task_pid() with adaptive NVML polling can take several seconds.
  3. Graceful timeout handling: lock_postinit() returns 1 on success, 0 on timeout. The caller in libvgpu.c:postInit() checks the return value and skips host PID detection on timeout instead of force-posting the
    semaphore. This prevents semaphore corruption entirely.
  4. Header/init changes: Added sem_t sem_postinit to shared_region_t, initialized via sem_init() in try_create_shrreg().

Commit 3: Make exit cleanup foolproof — atomic operations only, no semaphore needed (46399f2)

Problem: The old exit_handler() tried to acquire region->sem (with a 5s timeout) to clean up its process slot. When 8 processes exited simultaneously, semaphore contention was high — some processes timed out,
leaving stale owner_pid values and locked semaphores. The next run would see a dead owner and get stuck. Exit cleanup that can fail defeats its purpose.

Solution — complete redesign of exit_handler():

  1. No semaphore acquisition at all. Instead:
    - Atomically checks if this process holds owner_pid using atomic_load. If yes, uses CAS (atomic_compare_exchange_strong) to clear it to 0, then posts the semaphore to unlock it. CAS ensures only the actual
    owner clears it.
    - Atomically sets its process slot's PID to 0 (atomic_store_explicit with memory_order_release) and status to 0, marking the slot as dead. This is a simple, contention-free operation that cannot fail.
  2. Lazy slot cleanup: Dead slots (PID=0) are not physically removed during exit. Instead, clear_proc_slot_nolock() was enhanced to detect and remove PID=0 slots (from exit cleanup) in addition to checking
    proc_alive() for truly dead processes. This runs when the next process acquires the lock during initialization. Added a limit of 10 proc_alive() checks per call to avoid holding the lock too long.
  3. SIGKILL recovery in lock_shrreg(): SIGKILL is the only signal that bypasses the exit handler. On ETIMEDOUT, the code now uses atomic_load to read owner_pid, checks proc_alive(), and if the owner is dead, uses
    CAS to atomically clear owner_pid and post the semaphore. Only one process performs recovery thanks to CAS.
  4. Hard deadlock protection: After 30 retries (5 minutes), lock_shrreg() logs an error with diagnostic info (owner alive/dead) and exits with an actionable message: "Delete /tmp/cudevshr.cache and restart all
    processes."

Guarantees: Normal exit, signal exit (SIGTERM/SIGINT/etc.), and SIGKILL all result in clean state — either immediately or on the next process's lock acquisition.


Commit 4: Optimize seqlock and utilization watcher to prevent random 256MB allocation slowdowns (c67cbbe)

Problem: Random 20x performance variance observed (12.734ms vs 0.586ms for 256MB allocations) when all 8 processes allocate simultaneously. Two root causes:

  1. Seqlock retry storm: When 8 processes all write to their slots, readers see active writers (odd seqlock) and spin in a tight loop, causing CPU contention that slows down the writers further (a feedback loop).
  2. Utilization watcher contention: The get_used_gpu_utilization() function in multiprocess_utilization_watcher.c held lock_shrreg() while making slow NVML queries (nvmlDeviceGetComputeRunningProcesses,
    nvmlDeviceGetProcessUtilization), blocking all shared memory operations for the entire duration.

Solution:

  1. Seqlock exponential backoff (already described in Commit 1 — the backoff was added/refined in this commit): Removed the old "fallback to stale data after 100 retries" approach since memory checks require
    accurate data. Replaced with progressive delays: CPU pause instructions → 1μs → 10μs → 100μs. This reduces CPU contention while ensuring accurate reads.
  2. Utilization watcher lock scope reduction: Restructured get_used_gpu_utilization() to perform NVML queries outside the lock. For each device:
    - Call nvmlDeviceGetComputeRunningProcesses() and nvmlDeviceGetProcessUtilization() without holding the lock
    - Then acquire lock_shrreg() briefly only to update shared memory with the results
    - Unlock immediately after updates

This changes the lock scope from "entire loop over all devices including NVML calls" to "brief update of shared memory per device", reducing lock hold time from milliseconds to microseconds.


Commit 5: Merge commit (2175e14)

Merge commit consolidating all changes into the testpr155 branch.


Architecture Summary

The PR transforms HAMi-core's multi-process GPU memory management from a design with several race conditions and failure modes into a robust system with:

  • Seqlocks for consistent multi-field reads without blocking writers
  • Fresh timeouts on every semaphore wait iteration (fixing the root cause of 300s hangs)
  • Atomic-only exit cleanup that cannot fail or cause contention
  • CAS-based SIGKILL recovery so only one process handles dead-owner detection
  • Dedicated sem_postinit semaphore for host PID detection serialization
  • Minimized lock hold times by moving NVML queries outside critical sections

@hami-robot hami-robot bot added the size/L label Mar 24, 2026
@maverick123123 maverick123123 force-pushed the testpr155 branch 4 times, most recently from 81831a5 to d157690 Compare March 25, 2026 02:12
Nishit Shah and others added 4 commits March 25, 2026 10:23
- Add per-process seqlock counter for consistent snapshots
- Writers: increment seqlock (odd), update counters, increment (even)
- Readers: retry read if seqlock changes or is odd
- Use memory_order_release for writes, acquire for reads
- Guarantees: No partial reads, no stale aggregations
- Fallback to best-effort after 100 retries (prevents livelock)
- Adds CPU pause/yield instructions for efficient spinning

This ensures OOM protection works correctly even under heavy
concurrent memory allocation/deallocation workloads.

Signed-off-by: Nishit Shah <nish511@gmail.com>
Signed-off-by: Nishit Shah <nishshah@linkedin.com>
This fixes the critical bug where processes waited 300+ seconds for locks.

Root causes identified:
1. get_timespec() called ONCE before while loop, creating stale timestamp
2. After first timeout, all subsequent sem_timedwait() immediately timeout
3. Force-posting semaphore corrupted state, allowing multiple processes in

Fixes applied:

1. STALE TIMEOUT FIX (both lock_shrreg and lock_postinit):
   - Move get_timespec() INSIDE the while loop
   - Each iteration gets fresh 10s or 30s timeout
   - Prevents cascading immediate timeouts

2. LONGER TIMEOUT FOR POSTINIT:
   - SEM_WAIT_TIME_POSTINIT = 30s (vs 10s)
   - SEM_WAIT_RETRY_TIMES_POSTINIT = 10 (vs 30)
   - Total still 300s max, but longer per-wait since set_task_pid() can take time

3. GRACEFUL TIMEOUT (no force-post):
   - lock_postinit() returns 1 on success, 0 on timeout
   - Caller checks return value, only unlocks if lock was acquired
   - On timeout, skip host PID detection gracefully
   - Prevents semaphore corruption from force-posting

Why force-post is bad:
- sem_post() increments semaphore value
- If called when value is already 1 (unlocked), makes it 2
- Allows 2 processes to acquire simultaneously
- Breaks delta detection in set_task_pid()

Expected behavior after fix:
- Processes wait up to 30s per retry (plenty of time for set_task_pid)
- Timeouts create fresh timestamps each iteration
- If true deadlock (300s total), gracefully skip detection
- No semaphore corruption

Signed-off-by: Nishit Shah <nishshah@linkedin.com>
This completely redesigns exit cleanup to guarantee that previous runs ALWAYS
leave clean state, with no failure modes.

Problem with previous approach:
- Exit handler tried to acquire semaphore to clean up process slots
- With 8 processes exiting simultaneously, contention was high
- Some processes timed out and failed to clean up
- Left stale owner_pid and locked semaphores
- Next run would see "owner=7388" and get stuck

Root cause: Exit cleanup that CAN FAIL defeats the whole purpose!

New foolproof approach:

1. EXIT HANDLER (NO SEMAPHORE NEEDED):
   - Check if we're holding owner_pid atomically
   - If yes: CAS to clear it, post semaphore
   - Mark our process slot PID as 0 atomically
   - That's it! No semaphore acquisition, no contention, CANNOT FAIL

2. LAZY SLOT CLEANUP:
   - Dead slots (PID=0) are cleaned up by clear_proc_slot_nolock()
   - Called by init_proc_slot_withlock() when next process starts
   - Physical removal happens later, but slot is marked dead immediately

3. SIGKILL RECOVERY:
   - SIGKILL is the ONLY case where exit handler doesn't run
   - On lock timeout, check if owner is dead
   - If dead: CAS to clear, post semaphore (handles SIGKILL)
   - If alive: Real contention, keep waiting

Guarantees:
✅ Normal exit: owner_pid cleared, semaphore unlocked, slot marked dead
✅ Signal exit: Same (SIGTERM/SIGINT/SIGHUP/SIGABRT caught)
✅ SIGKILL: Detected and recovered within first timeout
✅ Next run: Always starts with clean state

Key insight: Critical cleanup (owner_pid, semaphore) must not require
acquiring a lock. Use atomic operations instead.

Signed-off-by: Nishit Shah <nishshah@linkedin.com>
…cation slowdowns

Root cause: Random 20x slowdowns (12.734ms vs 0.586ms) for 256MB allocations when
all 8 processes allocate simultaneously. Two issues:

1. Seqlock retry storm: When all 8 processes write to their slots, readers see
   writers active (seqlock odd) and spin in tight loop, causing CPU contention.

2. Utilization watcher contention: The utilization_watcher thread held lock_shrreg()
   during slow NVML queries (nvmlDeviceGetComputeRunningProcesses,
   nvmlDeviceGetProcessUtilization), blocking shared memory operations.

Fixes:

1. Seqlock exponential backoff:
   - Removed stale data fallback (memory checks require accurate data)
   - Progressive delays: CPU pause → 1μs → 10μs → 100μs
   - Prevents tight spinning while ensuring accurate reads

2. Utilization watcher optimization:
   - Moved NVML queries OUTSIDE lock_shrreg()
   - Lock now only held briefly to update shared memory
   - Reduces lock hold time from milliseconds to microseconds

Impact: Should eliminate random 256MB allocation slowdowns by reducing
seqlock contention and utilization watcher blocking.

Signed-off-by: Nishit Shah <nishshah@linkedin.com>
Signed-off-by: Nishit Shah <nishshah@linkedin.com>

Co-authored-by: Maverick123123 <yuming.wu@dynamia.ai>
Signed-off-by: Nishit Shah <nishshah@linkedin.com>

Co-authored-by: Maverick123123 <yuming.wu@dynamia.ai>
Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
@maverick123123 maverick123123 changed the title Merge pr155 some important commit Merge pr155 some important commits Mar 25, 2026
Copy link
Member

@archlitchi archlitchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hami-robot
Copy link
Contributor

hami-robot bot commented Mar 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, maverick123123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot bot added the approved label Mar 25, 2026
@hami-robot hami-robot bot merged commit 3273df8 into Project-HAMi:main Mar 25, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants