Merge pr155 some important commits#163
Merged
hami-robot[bot] merged 6 commits intoProject-HAMi:mainfrom Mar 25, 2026
Merged
Conversation
81831a5 to
d157690
Compare
- Add per-process seqlock counter for consistent snapshots - Writers: increment seqlock (odd), update counters, increment (even) - Readers: retry read if seqlock changes or is odd - Use memory_order_release for writes, acquire for reads - Guarantees: No partial reads, no stale aggregations - Fallback to best-effort after 100 retries (prevents livelock) - Adds CPU pause/yield instructions for efficient spinning This ensures OOM protection works correctly even under heavy concurrent memory allocation/deallocation workloads. Signed-off-by: Nishit Shah <nish511@gmail.com> Signed-off-by: Nishit Shah <nishshah@linkedin.com>
This fixes the critical bug where processes waited 300+ seconds for locks. Root causes identified: 1. get_timespec() called ONCE before while loop, creating stale timestamp 2. After first timeout, all subsequent sem_timedwait() immediately timeout 3. Force-posting semaphore corrupted state, allowing multiple processes in Fixes applied: 1. STALE TIMEOUT FIX (both lock_shrreg and lock_postinit): - Move get_timespec() INSIDE the while loop - Each iteration gets fresh 10s or 30s timeout - Prevents cascading immediate timeouts 2. LONGER TIMEOUT FOR POSTINIT: - SEM_WAIT_TIME_POSTINIT = 30s (vs 10s) - SEM_WAIT_RETRY_TIMES_POSTINIT = 10 (vs 30) - Total still 300s max, but longer per-wait since set_task_pid() can take time 3. GRACEFUL TIMEOUT (no force-post): - lock_postinit() returns 1 on success, 0 on timeout - Caller checks return value, only unlocks if lock was acquired - On timeout, skip host PID detection gracefully - Prevents semaphore corruption from force-posting Why force-post is bad: - sem_post() increments semaphore value - If called when value is already 1 (unlocked), makes it 2 - Allows 2 processes to acquire simultaneously - Breaks delta detection in set_task_pid() Expected behavior after fix: - Processes wait up to 30s per retry (plenty of time for set_task_pid) - Timeouts create fresh timestamps each iteration - If true deadlock (300s total), gracefully skip detection - No semaphore corruption Signed-off-by: Nishit Shah <nishshah@linkedin.com>
This completely redesigns exit cleanup to guarantee that previous runs ALWAYS leave clean state, with no failure modes. Problem with previous approach: - Exit handler tried to acquire semaphore to clean up process slots - With 8 processes exiting simultaneously, contention was high - Some processes timed out and failed to clean up - Left stale owner_pid and locked semaphores - Next run would see "owner=7388" and get stuck Root cause: Exit cleanup that CAN FAIL defeats the whole purpose! New foolproof approach: 1. EXIT HANDLER (NO SEMAPHORE NEEDED): - Check if we're holding owner_pid atomically - If yes: CAS to clear it, post semaphore - Mark our process slot PID as 0 atomically - That's it! No semaphore acquisition, no contention, CANNOT FAIL 2. LAZY SLOT CLEANUP: - Dead slots (PID=0) are cleaned up by clear_proc_slot_nolock() - Called by init_proc_slot_withlock() when next process starts - Physical removal happens later, but slot is marked dead immediately 3. SIGKILL RECOVERY: - SIGKILL is the ONLY case where exit handler doesn't run - On lock timeout, check if owner is dead - If dead: CAS to clear, post semaphore (handles SIGKILL) - If alive: Real contention, keep waiting Guarantees: ✅ Normal exit: owner_pid cleared, semaphore unlocked, slot marked dead ✅ Signal exit: Same (SIGTERM/SIGINT/SIGHUP/SIGABRT caught) ✅ SIGKILL: Detected and recovered within first timeout ✅ Next run: Always starts with clean state Key insight: Critical cleanup (owner_pid, semaphore) must not require acquiring a lock. Use atomic operations instead. Signed-off-by: Nishit Shah <nishshah@linkedin.com>
…cation slowdowns Root cause: Random 20x slowdowns (12.734ms vs 0.586ms) for 256MB allocations when all 8 processes allocate simultaneously. Two issues: 1. Seqlock retry storm: When all 8 processes write to their slots, readers see writers active (seqlock odd) and spin in tight loop, causing CPU contention. 2. Utilization watcher contention: The utilization_watcher thread held lock_shrreg() during slow NVML queries (nvmlDeviceGetComputeRunningProcesses, nvmlDeviceGetProcessUtilization), blocking shared memory operations. Fixes: 1. Seqlock exponential backoff: - Removed stale data fallback (memory checks require accurate data) - Progressive delays: CPU pause → 1μs → 10μs → 100μs - Prevents tight spinning while ensuring accurate reads 2. Utilization watcher optimization: - Moved NVML queries OUTSIDE lock_shrreg() - Lock now only held briefly to update shared memory - Reduces lock hold time from milliseconds to microseconds Impact: Should eliminate random 256MB allocation slowdowns by reducing seqlock contention and utilization watcher blocking. Signed-off-by: Nishit Shah <nishshah@linkedin.com>
d157690 to
474803b
Compare
Signed-off-by: Nishit Shah <nishshah@linkedin.com> Co-authored-by: Maverick123123 <yuming.wu@dynamia.ai>
474803b to
e200da2
Compare
Signed-off-by: Nishit Shah <nishshah@linkedin.com> Co-authored-by: Maverick123123 <yuming.wu@dynamia.ai> Signed-off-by: Maverick123123 <yuming.wu@dynamia.ai>
7c181e5 to
9cb90f5
Compare
Contributor
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: archlitchi, maverick123123 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Commit 1: Add seqlock for precise memory accounting (c2ae4aa)
Problem: The existing lock-free atomic reads used memory_order_relaxed for reading per-process memory counters. While individual atomic loads are safe, a reader could see a torn snapshot — e.g., reading total
updated by one allocation but context_size not yet updated — leading to inconsistent memory accounting and potentially incorrect OOM decisions.
Solution: Introduces a per-process sequence lock (seqlock) — a _Atomic uint64_t seqlock field added to shrreg_proc_slot_t in the header.
memory_order_release, then increments seqlock to even (signaling "write complete"). This applies to both the fast path (cached my_slot pointer for own PID) and the slow path (linear scan for other PIDs).
(writer active), spins with exponential backoff: first 5 retries use CPU pause instructions (pause on x86, yield on ARM), then 1μs delays, then 10μs, then 100μs after 100 retries with debug logging.
Commit 2: Critical fix — Stale semaphore timeouts causing 300s hangs (fb4f906)
Problem: Processes would hang for 300+ seconds waiting for lock_shrreg(). Root cause: get_timespec() was called once before the while loop. After the first sem_timedwait() timeout, the timestamp was already in
the past, so every subsequent call returned ETIMEDOUT immediately — the process burned through all 30 retries in milliseconds with no actual waiting, then force-posted the semaphore. sem_post() on an
already-unlocked semaphore (value=1) made it value=2, allowing two processes to enter the critical section simultaneously, corrupting delta detection in set_task_pid().
Solution:
SEM_WAIT_RETRY_TIMES_POSTINIT=10 retries (total 300s max), since set_task_pid() with adaptive NVML polling can take several seconds.
semaphore. This prevents semaphore corruption entirely.
Commit 3: Make exit cleanup foolproof — atomic operations only, no semaphore needed (46399f2)
Problem: The old exit_handler() tried to acquire region->sem (with a 5s timeout) to clean up its process slot. When 8 processes exited simultaneously, semaphore contention was high — some processes timed out,
leaving stale owner_pid values and locked semaphores. The next run would see a dead owner and get stuck. Exit cleanup that can fail defeats its purpose.
Solution — complete redesign of exit_handler():
- Atomically checks if this process holds owner_pid using atomic_load. If yes, uses CAS (atomic_compare_exchange_strong) to clear it to 0, then posts the semaphore to unlock it. CAS ensures only the actual
owner clears it.
- Atomically sets its process slot's PID to 0 (atomic_store_explicit with memory_order_release) and status to 0, marking the slot as dead. This is a simple, contention-free operation that cannot fail.
proc_alive() for truly dead processes. This runs when the next process acquires the lock during initialization. Added a limit of 10 proc_alive() checks per call to avoid holding the lock too long.
CAS to atomically clear owner_pid and post the semaphore. Only one process performs recovery thanks to CAS.
processes."
Guarantees: Normal exit, signal exit (SIGTERM/SIGINT/etc.), and SIGKILL all result in clean state — either immediately or on the next process's lock acquisition.
Commit 4: Optimize seqlock and utilization watcher to prevent random 256MB allocation slowdowns (c67cbbe)
Problem: Random 20x performance variance observed (12.734ms vs 0.586ms for 256MB allocations) when all 8 processes allocate simultaneously. Two root causes:
nvmlDeviceGetProcessUtilization), blocking all shared memory operations for the entire duration.
Solution:
accurate data. Replaced with progressive delays: CPU pause instructions → 1μs → 10μs → 100μs. This reduces CPU contention while ensuring accurate reads.
- Call nvmlDeviceGetComputeRunningProcesses() and nvmlDeviceGetProcessUtilization() without holding the lock
- Then acquire lock_shrreg() briefly only to update shared memory with the results
- Unlock immediately after updates
This changes the lock scope from "entire loop over all devices including NVML calls" to "brief update of shared memory per device", reducing lock hold time from milliseconds to microseconds.
Commit 5: Merge commit (2175e14)
Merge commit consolidating all changes into the testpr155 branch.
Architecture Summary
The PR transforms HAMi-core's multi-process GPU memory management from a design with several race conditions and failure modes into a robust system with: