Merge pr155 some important commits by maverick123123 · Pull Request #163 · Project-HAMi/HAMi-core

maverick123123 · 2026-03-24T11:48:58Z

Commit 1: Add seqlock for precise memory accounting (c2ae4aa)

Problem: The existing lock-free atomic reads used memory_order_relaxed for reading per-process memory counters. While individual atomic loads are safe, a reader could see a torn snapshot — e.g., reading total
updated by one allocation but context_size not yet updated — leading to inconsistent memory accounting and potentially incorrect OOM decisions.

Solution: Introduces a per-process sequence lock (seqlock) — a _Atomic uint64_t seqlock field added to shrreg_proc_slot_t in the header.

Writer side (add_gpu_device_memory_usage, rm_gpu_device_memory_usage): Before updating counters, atomically increments seqlock to an odd value (signaling "write in progress"), performs all counter updates with
memory_order_release, then increments seqlock to even (signaling "write complete"). This applies to both the fast path (cached my_slot pointer for own PID) and the slow path (linear scan for other PIDs).
Reader side (get_gpu_memory_usage): For each process slot, reads seq1, reads the memory counter, inserts an acquire fence, reads seq2. If seq1 != seq2 (write happened during read), retries. If seq1 is odd
(writer active), spins with exponential backoff: first 5 retries use CPU pause instructions (pause on x86, yield on ARM), then 1μs delays, then 10μs, then 100μs after 100 retries with debug logging.
Header change: Added _Atomic uint64_t seqlock to shrreg_proc_slot_t, reduced unused[3] to unused[2] to maintain struct size/alignment. Seqlock is initialized to 0 (even = no write) in init_proc_slot_withlock.

Commit 2: Critical fix — Stale semaphore timeouts causing 300s hangs (fb4f906)

Problem: Processes would hang for 300+ seconds waiting for lock_shrreg(). Root cause: get_timespec() was called once before the while loop. After the first sem_timedwait() timeout, the timestamp was already in
the past, so every subsequent call returned ETIMEDOUT immediately — the process burned through all 30 retries in milliseconds with no actual waiting, then force-posted the semaphore. sem_post() on an
already-unlocked semaphore (value=1) made it value=2, allowing two processes to enter the critical section simultaneously, corrupting delta detection in set_task_pid().

Solution:

Moved get_timespec() inside the while loop in both lock_shrreg() and the new lock_postinit() — each iteration gets a fresh absolute timeout.
New lock_postinit() / unlock_postinit() functions using a dedicated sem_postinit semaphore for serializing host PID detection during postInit(). Uses longer timeouts: SEM_WAIT_TIME_POSTINIT=30s per wait,
SEM_WAIT_RETRY_TIMES_POSTINIT=10 retries (total 300s max), since set_task_pid() with adaptive NVML polling can take several seconds.
Graceful timeout handling: lock_postinit() returns 1 on success, 0 on timeout. The caller in libvgpu.c:postInit() checks the return value and skips host PID detection on timeout instead of force-posting the
semaphore. This prevents semaphore corruption entirely.
Header/init changes: Added sem_t sem_postinit to shared_region_t, initialized via sem_init() in try_create_shrreg().

Commit 3: Make exit cleanup foolproof — atomic operations only, no semaphore needed (46399f2)

Problem: The old exit_handler() tried to acquire region->sem (with a 5s timeout) to clean up its process slot. When 8 processes exited simultaneously, semaphore contention was high — some processes timed out,
leaving stale owner_pid values and locked semaphores. The next run would see a dead owner and get stuck. Exit cleanup that can fail defeats its purpose.

Solution — complete redesign of exit_handler():

No semaphore acquisition at all. Instead:
- Atomically checks if this process holds owner_pid using atomic_load. If yes, uses CAS (atomic_compare_exchange_strong) to clear it to 0, then posts the semaphore to unlock it. CAS ensures only the actual
owner clears it.
- Atomically sets its process slot's PID to 0 (atomic_store_explicit with memory_order_release) and status to 0, marking the slot as dead. This is a simple, contention-free operation that cannot fail.
Lazy slot cleanup: Dead slots (PID=0) are not physically removed during exit. Instead, clear_proc_slot_nolock() was enhanced to detect and remove PID=0 slots (from exit cleanup) in addition to checking
proc_alive() for truly dead processes. This runs when the next process acquires the lock during initialization. Added a limit of 10 proc_alive() checks per call to avoid holding the lock too long.
SIGKILL recovery in lock_shrreg(): SIGKILL is the only signal that bypasses the exit handler. On ETIMEDOUT, the code now uses atomic_load to read owner_pid, checks proc_alive(), and if the owner is dead, uses
CAS to atomically clear owner_pid and post the semaphore. Only one process performs recovery thanks to CAS.
Hard deadlock protection: After 30 retries (5 minutes), lock_shrreg() logs an error with diagnostic info (owner alive/dead) and exits with an actionable message: "Delete /tmp/cudevshr.cache and restart all
processes."

Guarantees: Normal exit, signal exit (SIGTERM/SIGINT/etc.), and SIGKILL all result in clean state — either immediately or on the next process's lock acquisition.

Commit 4: Optimize seqlock and utilization watcher to prevent random 256MB allocation slowdowns (c67cbbe)

Problem: Random 20x performance variance observed (12.734ms vs 0.586ms for 256MB allocations) when all 8 processes allocate simultaneously. Two root causes:

Seqlock retry storm: When 8 processes all write to their slots, readers see active writers (odd seqlock) and spin in a tight loop, causing CPU contention that slows down the writers further (a feedback loop).
Utilization watcher contention: The get_used_gpu_utilization() function in multiprocess_utilization_watcher.c held lock_shrreg() while making slow NVML queries (nvmlDeviceGetComputeRunningProcesses,
nvmlDeviceGetProcessUtilization), blocking all shared memory operations for the entire duration.

Solution:

Seqlock exponential backoff (already described in Commit 1 — the backoff was added/refined in this commit): Removed the old "fallback to stale data after 100 retries" approach since memory checks require
accurate data. Replaced with progressive delays: CPU pause instructions → 1μs → 10μs → 100μs. This reduces CPU contention while ensuring accurate reads.
Utilization watcher lock scope reduction: Restructured get_used_gpu_utilization() to perform NVML queries outside the lock. For each device:
- Call nvmlDeviceGetComputeRunningProcesses() and nvmlDeviceGetProcessUtilization() without holding the lock
- Then acquire lock_shrreg() briefly only to update shared memory with the results
- Unlock immediately after updates

This changes the lock scope from "entire loop over all devices including NVML calls" to "brief update of shared memory per device", reducing lock hold time from milliseconds to microseconds.

Commit 5: Merge commit (2175e14)

Merge commit consolidating all changes into the testpr155 branch.

Architecture Summary

The PR transforms HAMi-core's multi-process GPU memory management from a design with several race conditions and failure modes into a robust system with:

Seqlocks for consistent multi-field reads without blocking writers
Fresh timeouts on every semaphore wait iteration (fixing the root cause of 300s hangs)
Atomic-only exit cleanup that cannot fail or cause contention
CAS-based SIGKILL recovery so only one process handles dead-owner detection
Dedicated sem_postinit semaphore for host PID detection serialization
Minimized lock hold times by moving NVML queries outside critical sections

- Add per-process seqlock counter for consistent snapshots - Writers: increment seqlock (odd), update counters, increment (even) - Readers: retry read if seqlock changes or is odd - Use memory_order_release for writes, acquire for reads - Guarantees: No partial reads, no stale aggregations - Fallback to best-effort after 100 retries (prevents livelock) - Adds CPU pause/yield instructions for efficient spinning This ensures OOM protection works correctly even under heavy concurrent memory allocation/deallocation workloads. Signed-off-by: Nishit Shah <nish511@gmail.com> Signed-off-by: Nishit Shah <nishshah@linkedin.com>

This fixes the critical bug where processes waited 300+ seconds for locks. Root causes identified: 1. get_timespec() called ONCE before while loop, creating stale timestamp 2. After first timeout, all subsequent sem_timedwait() immediately timeout 3. Force-posting semaphore corrupted state, allowing multiple processes in Fixes applied: 1. STALE TIMEOUT FIX (both lock_shrreg and lock_postinit): - Move get_timespec() INSIDE the while loop - Each iteration gets fresh 10s or 30s timeout - Prevents cascading immediate timeouts 2. LONGER TIMEOUT FOR POSTINIT: - SEM_WAIT_TIME_POSTINIT = 30s (vs 10s) - SEM_WAIT_RETRY_TIMES_POSTINIT = 10 (vs 30) - Total still 300s max, but longer per-wait since set_task_pid() can take time 3. GRACEFUL TIMEOUT (no force-post): - lock_postinit() returns 1 on success, 0 on timeout - Caller checks return value, only unlocks if lock was acquired - On timeout, skip host PID detection gracefully - Prevents semaphore corruption from force-posting Why force-post is bad: - sem_post() increments semaphore value - If called when value is already 1 (unlocked), makes it 2 - Allows 2 processes to acquire simultaneously - Breaks delta detection in set_task_pid() Expected behavior after fix: - Processes wait up to 30s per retry (plenty of time for set_task_pid) - Timeouts create fresh timestamps each iteration - If true deadlock (300s total), gracefully skip detection - No semaphore corruption Signed-off-by: Nishit Shah <nishshah@linkedin.com>

This completely redesigns exit cleanup to guarantee that previous runs ALWAYS leave clean state, with no failure modes. Problem with previous approach: - Exit handler tried to acquire semaphore to clean up process slots - With 8 processes exiting simultaneously, contention was high - Some processes timed out and failed to clean up - Left stale owner_pid and locked semaphores - Next run would see "owner=7388" and get stuck Root cause: Exit cleanup that CAN FAIL defeats the whole purpose! New foolproof approach: 1. EXIT HANDLER (NO SEMAPHORE NEEDED): - Check if we're holding owner_pid atomically - If yes: CAS to clear it, post semaphore - Mark our process slot PID as 0 atomically - That's it! No semaphore acquisition, no contention, CANNOT FAIL 2. LAZY SLOT CLEANUP: - Dead slots (PID=0) are cleaned up by clear_proc_slot_nolock() - Called by init_proc_slot_withlock() when next process starts - Physical removal happens later, but slot is marked dead immediately 3. SIGKILL RECOVERY: - SIGKILL is the ONLY case where exit handler doesn't run - On lock timeout, check if owner is dead - If dead: CAS to clear, post semaphore (handles SIGKILL) - If alive: Real contention, keep waiting Guarantees: ✅ Normal exit: owner_pid cleared, semaphore unlocked, slot marked dead ✅ Signal exit: Same (SIGTERM/SIGINT/SIGHUP/SIGABRT caught) ✅ SIGKILL: Detected and recovered within first timeout ✅ Next run: Always starts with clean state Key insight: Critical cleanup (owner_pid, semaphore) must not require acquiring a lock. Use atomic operations instead. Signed-off-by: Nishit Shah <nishshah@linkedin.com>

…cation slowdowns Root cause: Random 20x slowdowns (12.734ms vs 0.586ms) for 256MB allocations when all 8 processes allocate simultaneously. Two issues: 1. Seqlock retry storm: When all 8 processes write to their slots, readers see writers active (seqlock odd) and spin in tight loop, causing CPU contention. 2. Utilization watcher contention: The utilization_watcher thread held lock_shrreg() during slow NVML queries (nvmlDeviceGetComputeRunningProcesses, nvmlDeviceGetProcessUtilization), blocking shared memory operations. Fixes: 1. Seqlock exponential backoff: - Removed stale data fallback (memory checks require accurate data) - Progressive delays: CPU pause → 1μs → 10μs → 100μs - Prevents tight spinning while ensuring accurate reads 2. Utilization watcher optimization: - Moved NVML queries OUTSIDE lock_shrreg() - Lock now only held briefly to update shared memory - Reduces lock hold time from milliseconds to microseconds Impact: Should eliminate random 256MB allocation slowdowns by reducing seqlock contention and utilization watcher blocking. Signed-off-by: Nishit Shah <nishshah@linkedin.com>

Signed-off-by: Nishit Shah <nishshah@linkedin.com> Co-authored-by: Maverick123123 <yuming.wu@dynamia.ai>

Signed-off-by: Nishit Shah <nishshah@linkedin.com> Co-authored-by: Maverick123123 <yuming.wu@dynamia.ai> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

archlitchi

/lgtm

hami-robot · 2026-03-25T10:01:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, maverick123123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [archlitchi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot bot added the dco-signoff: yes label Mar 24, 2026

hami-robot bot requested review from archlitchi and chaunceyjiang March 24, 2026 11:49

hami-robot bot added the size/L label Mar 24, 2026

maverick123123 force-pushed the testpr155 branch 4 times, most recently from 81831a5 to d157690 Compare March 25, 2026 02:12

Nishit Shah and others added 4 commits March 25, 2026 10:23

maverick123123 force-pushed the testpr155 branch from d157690 to 474803b Compare March 25, 2026 02:40

merge pr155 some important commits

e200da2

Signed-off-by: Nishit Shah <nishshah@linkedin.com> Co-authored-by: Maverick123123 <yuming.wu@dynamia.ai>

maverick123123 force-pushed the testpr155 branch from 474803b to e200da2 Compare March 25, 2026 02:46

merge pr155 some important commits

9cb90f5

Signed-off-by: Nishit Shah <nishshah@linkedin.com> Co-authored-by: Maverick123123 <yuming.wu@dynamia.ai> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>

maverick123123 force-pushed the testpr155 branch from 7c181e5 to 9cb90f5 Compare March 25, 2026 03:04

maverick123123 changed the title ~~Merge pr155 some important commit~~ Merge pr155 some important commits Mar 25, 2026

archlitchi approved these changes Mar 25, 2026

View reviewed changes

hami-robot bot assigned archlitchi Mar 25, 2026

hami-robot bot added the lgtm label Mar 25, 2026

hami-robot bot added the approved label Mar 25, 2026

hami-robot bot merged commit 3273df8 into Project-HAMi:main Mar 25, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge pr155 some important commits#163

Merge pr155 some important commits#163
hami-robot[bot] merged 6 commits intoProject-HAMi:mainfrom
maverick123123:testpr155

maverick123123 commented Mar 24, 2026

Uh oh!

archlitchi left a comment

Uh oh!

hami-robot bot commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maverick123123 commented Mar 24, 2026

Uh oh!

archlitchi left a comment

Choose a reason for hiding this comment

Uh oh!

hami-robot bot commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants