Project-HAMi · nishitnshah · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026
diff --git a/docs/performance-optimizations.md b/docs/performance-optimizations.md
@@ -0,0 +1,198 @@
+# HAMi-core Performance Optimizations
+
+## Background
+
+The HAMi-core CUDA library hijacking layer was introducing ~23% overhead
+on training workloads. Profiling identified the overhead in per-call work
+performed by intercepted CUDA functions — not in the actual resource
+limiting logic, but in bookkeeping (logging, status checks, shared memory
+reads) that executed on every CUDA API call regardless of whether limiting
+was active.
+
+This document describes the P0 (critical) and P1 (high-impact)
+optimizations applied to reduce this overhead.
+
+---
+
+## Commit 1 — [P0] Cache log level (`log_utils.h`, `utils.c`, `libvgpu.c`)
+
+### Problem
+Every `LOG_DEBUG`, `LOG_INFO`, `LOG_WARN`, and `LOG_MSG` macro called
+`getenv("LIBCUDA_LOG_LEVEL")` and `atoi()` on every invocation.
+`getenv()` performs a linear scan of the environment block. These macros
+appear in every intercepted CUDA function, so this overhead accumulated
+across thousands of calls per second.
+
+### Fix
+- Added `int g_log_level` global variable (default: 2 = warn level).
+- Added `log_utils_init()` function that reads `LIBCUDA_LOG_LEVEL` once.
+- Rewrote all `LOG_*` macros to check `g_log_level` (a single integer
+  comparison) instead of calling `getenv()` + `atoi()`.
+- `log_utils_init()` is called from `preInit()` in `libvgpu.c`.
+
+### Behavior preserved
+- Default log level (env unset) remains 2 (WARN/MSG/ERROR), matching the
+  original behavior where `LOG_WARN`/`LOG_MSG` logged when env was NULL.
+- `LOG_ERROR` remains unconditional (always emitted).
+- `LIBCUDA_LOG_LEVEL` env var still controls log level; it is simply
+  read once at library load time.
+
+### Testing
+- Set `LIBCUDA_LOG_LEVEL=4` and verify DEBUG output appears.
+- Set `LIBCUDA_LOG_LEVEL=0` and verify only ERROR output appears.
+- Unset `LIBCUDA_LOG_LEVEL` and verify WARN/MSG/ERROR output appears
+  (same as before).
+
+---
+
+## Commit 2 — [P0] Use cached slot in `wait_status_self()` (`multiprocess_memory_limit.c`)
+
+### Problem
+`wait_status_self()` is called via the `ENSURE_RUNNING()` macro on every
+kernel launch and memory operation. It performed a linear scan through
+all process slots (up to 1024 entries), comparing PIDs via `getpid()` to
+find the current process's status field. This was O(n) per call.
+
+### Fix
+- Use the already-cached `region_info.my_slot` pointer (set during
+  `init_proc_slot_withlock()` at startup) for O(1) direct access.
+- Fall back to linear scan only if `my_slot` is NULL (defensive, should
+  not happen after initialization).
+- Used proper `atomic_load_explicit` with `memory_order_acquire` for
+  reading shared-memory fields.
+
+### Behavior preserved
+- Returns the same values as before: 1 if status matches, 0 if not, -1
+  if process not found.
+- The slow path (linear scan) is identical to the original logic.
+
+### Testing
+- Run any CUDA workload with HAMi — `ENSURE_RUNNING()` is exercised on
+  every kernel launch and memory allocation.
+
+---
+
+## Commit 3 — [P1] Optimize `pre_launch_kernel()` (`multiprocess_memory_limit.c`)
+
+### Problem
+`pre_launch_kernel()` runs on every `cuLaunchKernel` call. It was:
+1. Calling `time(NULL)` — a syscall into the kernel.
+2. Always acquiring `pthread_mutex_lock` even when the timestamp had not
+   changed (recording interval is 1 second, kernels fire thousands/sec).
+
+### Fix
+- Replaced `time(NULL)` with `clock_gettime(CLOCK_REALTIME_COARSE)`,
+  which is served from the Linux vDSO (no syscall). Resolution is ~1-4ms,
+  which is irrelevant for a 1-second recording interval. Uses
+  `CLOCK_REALTIME_COARSE` (not `MONOTONIC`) to preserve epoch-time
+  semantics for `dump_shrreg` and other consumers.
+- Added double-checked locking: check the timestamp *before* acquiring
+  the mutex. The fast path (>99.99% of calls) becomes a single memory
+  read + integer comparison. The mutex is only taken when an update is
+  actually needed (~once per second).
+
+### Correctness notes
+- The unlocked read of `region_info.last_kernel_time` is safe: `uint64_t`
+  reads are atomic on x86-64 and aarch64 (aligned). A torn read would at
+  worst cause one unnecessary mutex acquisition, not incorrect behavior.
+- The atomic CAS update to the shared region is unchanged.
+
+### Testing
+- Run `dump_shrreg` tool while a CUDA workload is active — verify
+  `last_kernel_time` still updates correctly (once per second).
+- Run a kernel-intensive workload and compare throughput with/without
+  this change.
+
+---
+
+## Commit 4 — [P1] Optimize `rate_limiter()` (`multiprocess_utilization_watcher.c`)
+
+### Problem
+`rate_limiter()` runs on every kernel launch when `pidfound==1`. Before
+reaching the actual rate-limiting CAS loop, it performed:
+- `get_recent_kernel()` — shared memory read
+- `set_recent_kernel(2)` — shared memory write (always writing 2, which
+  was already the value — a no-op that dirtied a cross-process cache line)
+- `get_current_device_sm_limit(0)` — called twice (redundant)
+- `get_utilization_switch()` — shared memory read
+
+That is 3 shared memory reads + 1 write + 2 `ensure_initialized()` calls
+on every kernel launch, even when rate limiting was inactive.
+
+### Fix
+- Cache `sm_limit` and `utilization_switch` in static locals during
+  `init_utilization_watcher()`. These values are set at container startup
+  and do not change at runtime.
+- Fast-exit check uses cached locals: when limiting is inactive
+  (sm_limit >= 100 or == 0), `rate_limiter` returns after a single
+  branch on a local variable.
+- Removed `set_recent_kernel(2)` — eliminated the shared memory write.
+- Removed the duplicate `get_current_device_sm_limit(0)` call.
+- Reduced `sleep(1)` to `usleep(1000)` in the defensive `recent_kernel`
+  guard (currently unreachable but safer if triggered externally).
+- The CAS spin loop and 10ms `nanosleep` backoff are **unchanged**,
+  preserving correct rate-limiting when 0 < sm_limit < 100.
+
+### Behavior preserved
+- When SM limiting is active (0 < sm_limit < 100), the token-bucket
+  mechanism works identically — the only difference is reaching the CAS
+  loop faster (fewer shared memory reads before it).
+- The utilization watcher thread is still created under the same
+  conditions.
+
+### Testing
+- With `CUDA_DEVICE_SM_LIMIT=50`: verify SM utilization is capped as
+  before. Run a compute-heavy workload and confirm utilization stays
+  near the configured limit.
+- With `CUDA_DEVICE_SM_LIMIT=100` (or unset): verify no rate limiting
+  occurs and kernel throughput matches native (no HAMi) baseline.
+
+---
+
+## Commit 5 — [P1] Remove dead `cuDeviceGetCount` in `oom_check()` (`allocator.c`)
+
+### Problem
+`oom_check()` called `cuDeviceGetCount()` on every memory allocation,
+storing the result in `count1` — which was never read. This was a wasted
+CUDA driver API call on every allocation.
+
+### Fix
+Removed the dead `cuDeviceGetCount` call and the unused `count1`
+variable. The function only needs the specific device ID passed via the
+`dev` parameter, not the total device count.
+
+### Testing
+- Run memory allocation tests (`test_alloc`, `test_runtime_alloc`, etc.)
+  to verify OOM checking still works correctly.
+
+---
+
+## Expected Impact
+
+| Change | Per-call overhead removed | Frequency |
+|--------|--------------------------|-----------|
+| Cached log level | `getenv()` + `atoi()` per LOG macro | Every CUDA call |
+| Cached `my_slot` in `wait_status_self` | O(n) linear scan of process slots | Every kernel launch + memory op |
+| vDSO clock + double-checked lock | `time()` syscall + mutex lock/unlock | Every kernel launch |
+| Cached rate_limiter limits | 3 shared mem reads + 1 write | Every kernel launch |
+| Remove dead `cuDeviceGetCount` | 1 driver API call | Every memory allocation |
+
+Combined, these changes should reduce the hijacking overhead from ~23%
+to under 5% for typical training workloads.
+
+## How to benchmark
+
+```bash
+# Baseline (no HAMi):
+python ../semantic-id-recsys/semantic-id-training/test_hami_slowdown.py
+
+# With HAMi (original):
+LD_PRELOAD=/path/to/original/libvgpu.so \
+  python ../semantic-id-recsys/semantic-id-training/test_hami_slowdown.py
+
+# With HAMi (optimized):
+LD_PRELOAD=/path/to/optimized/libvgpu.so \
+  python ../semantic-id-recsys/semantic-id-training/test_hami_slowdown.py
+```
+
+Compare wall-clock time and throughput (samples/sec) across the three runs.
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -14,7 +14,7 @@ add_subdirectory(cuda)
 add_subdirectory(nvml)
 
 set(LIBVGPU vgpu)
-add_library(${LIBVGPU} SHARED libvgpu.c utils.c $<TARGET_OBJECTS:nvml_mod> $<TARGET_OBJECTS:cuda_mod> $<TARGET_OBJECTS:allocator_mod> $<TARGET_OBJECTS:multiprocess_mod>)
+add_library(${LIBVGPU} SHARED libvgpu.c utils.c log_utils.c $<TARGET_OBJECTS:nvml_mod> $<TARGET_OBJECTS:cuda_mod> $<TARGET_OBJECTS:allocator_mod> $<TARGET_OBJECTS:multiprocess_mod>)
 target_compile_options(${LIBVGPU} PUBLIC ${LIBRARY_COMPILE_FLAGS})
 target_link_libraries(${LIBVGPU} PUBLIC -lcuda -lnvidia-ml)
 

diff --git a/src/allocator/allocator.c b/src/allocator/allocator.c
@@ -34,8 +34,6 @@ size_t round_up(size_t size, size_t unit) {
 }
 
 int oom_check(const int dev, size_t addon) {
-    int count1=0;
-    CUDA_OVERRIDE_CALL(cuda_library_entry,cuDeviceGetCount,&count1);
     CUdevice d;
     if (dev==-1)
         cuCtxGetDevice(&d);

diff --git a/src/include/log_utils.h b/src/include/log_utils.h
@@ -10,33 +10,39 @@
 
 extern FILE *fp1;
 
-#ifdef FILEDEBUG 
+/*
+ * Cached log level — initialized once by log_utils_init().
+ * Default is 2 (WARN/MSG/ERROR) to match original behavior when
+ * LIBCUDA_LOG_LEVEL is unset.
+ *
+ * Levels: 0=off, 1=error-only, 2=warn(default), 3=info, 4=debug
+ */
+extern int g_log_level;
+
+/* Call once during early initialization to cache LIBCUDA_LOG_LEVEL. */
+void log_utils_init(void);
+
+#ifdef FILEDEBUG
 #define LOG_DEBUG(msg, ...) { \
-    if ((getenv("LIBCUDA_LOG_LEVEL")!=NULL) && (atoi(getenv("LIBCUDA_LOG_LEVEL"))>=4)) {\
+    if (g_log_level >= 4) {\
         if (fp1==NULL) fp1 = fopen ("/tmp/vgpulog", "a"); \
         fprintf(fp1, "[HAMI-core Debug(%d:%ld:%s:%d)]: "msg"\n",getpid(),pthread_self(),basename(__FILE__),__LINE__,##__VA_ARGS__); \
         }\
     }
 #define LOG_INFO(msg, ...) { \
-    if ( \
-         /*(getenv("LIBCUDA_LOG_LEVEL")==NULL) || */\
-         (getenv("LIBCUDA_LOG_LEVEL")!=NULL) && (atoi(getenv("LIBCUDA_LOG_LEVEL"))>=3)) {\
+    if (g_log_level >= 3) {\
         if (fp1==NULL) fp1 = fopen ("/tmp/vgpulog", "a"); \
         fprintf(fp1, "[HAMI-core Info(%d:%ld:%s:%d)]: "msg"\n", getpid(),pthread_self(),basename(__FILE__),__LINE__,##__VA_ARGS__); \
          }\
     }
 #define LOG_WARN(msg, ...) { \
-    if ( \
-        (getenv("LIBCUDA_LOG_LEVEL")==NULL) || \
-        ((getenv("LIBCUDA_LOG_LEVEL")!=NULL) && (atoi(getenv("LIBCUDA_LOG_LEVEL"))>=2))) {\
+    if (g_log_level >= 2) {\
         if (fp1==NULL) fp1 = fopen ("/tmp/vgpulog", "a"); \
         fprintf(fp1, "[HAMI-core Warn(%d:%ld:%s:%d)]: "msg"\n", getpid(),pthread_self(),basename(__FILE__),__LINE__,##__VA_ARGS__); \
         }\
     }
 #define LOG_MSG(msg, ...) { \
-    if ( \
-        (getenv("LIBCUDA_LOG_LEVEL")==NULL) || \
-        ((getenv("LIBCUDA_LOG_LEVEL")!=NULL) && (atoi(getenv("LIBCUDA_LOG_LEVEL"))>=2))) {\
+    if (g_log_level >= 2) {\
         if (fp1==NULL) fp1 = fopen ("/tmp/vgpulog", "a"); \
         fprintf(fp1, "[HAMI-core Msg(%d:%ld:%s:%d)]: "msg"\n", getpid(),pthread_self(),basename(__FILE__),__LINE__,##__VA_ARGS__); \
          }\
@@ -47,27 +53,22 @@ extern FILE *fp1;
 }
 #else
 #define LOG_DEBUG(msg, ...) { \
-    if ((getenv("LIBCUDA_LOG_LEVEL")!=NULL) && (atoi(getenv("LIBCUDA_LOG_LEVEL"))>=4)) {\
+    if (g_log_level >= 4) {\
         fprintf(stderr, "[HAMI-core Debug(%d:%ld:%s:%d)]: "msg"\n",getpid(),pthread_self(),basename(__FILE__),__LINE__,##__VA_ARGS__); \
          }\
     }
 #define LOG_INFO(msg, ...) { \
-    if ( \
-        (getenv("LIBCUDA_LOG_LEVEL")!=NULL) && (atoi(getenv("LIBCUDA_LOG_LEVEL"))>=3)) {\
+    if (g_log_level >= 3) {\
         fprintf(stderr, "[HAMI-core Info(%d:%ld:%s:%d)]: "msg"\n", getpid(),pthread_self(),basename(__FILE__),__LINE__,##__VA_ARGS__); \
         }\
     }
 #define LOG_WARN(msg, ...) { \
-    if ( \
-        (getenv("LIBCUDA_LOG_LEVEL")==NULL) || \
-        ((getenv("LIBCUDA_LOG_LEVEL")!=NULL) && (atoi(getenv("LIBCUDA_LOG_LEVEL"))>=2))) {\
+    if (g_log_level >= 2) {\
         fprintf(stderr, "[HAMI-core Warn(%d:%ld:%s:%d)]: "msg"\n", getpid(),pthread_self(),basename(__FILE__),__LINE__,##__VA_ARGS__); \
         }\
     }
 #define LOG_MSG(msg, ...) { \
-    if ( \
-        (getenv("LIBCUDA_LOG_LEVEL")==NULL) || \
-        ((getenv("LIBCUDA_LOG_LEVEL")!=NULL) && (atoi(getenv("LIBCUDA_LOG_LEVEL"))>=2))) {\
+    if (g_log_level >= 2) {\
         fprintf(stderr, "[HAMI-core Msg(%d:%ld:%s:%d)]: "msg"\n", getpid(),pthread_self(),basename(__FILE__),__LINE__,##__VA_ARGS__); \
          }\
     }
@@ -107,7 +108,7 @@ extern FILE *fp1;
 #define IF_CHECK_OOM(res) {                   \
     if (res < 0)                              \
         return CUDA_ERROR_OUT_OF_MEMORY;      \
-}     
+}
 
 
 #endif
diff --git a/src/libvgpu.c b/src/libvgpu.c
@@ -853,6 +853,7 @@ void* __dlsym_hook_section_nvml(void* handle, const char* symbol) {
 }
 
 void preInit(){
+    log_utils_init();
     LOG_MSG("Initializing.....");
     if (real_dlsym == NULL) {
         real_dlsym = dlvsym(RTLD_NEXT,"dlsym","GLIBC_2.2.5");

diff --git a/src/log_utils.c b/src/log_utils.c
@@ -0,0 +1,18 @@
+#include <stdio.h>
+#include <stdlib.h>
+
+/*
+ * Cached log level, read once from LIBCUDA_LOG_LEVEL by log_utils_init().
+ * Default 2 = warn/msg/error (matches original behavior when env is unset).
+ */
+int g_log_level = 2;
+
+FILE *fp1 = NULL;
+
+void log_utils_init(void) {
+    const char *env = getenv("LIBCUDA_LOG_LEVEL");
+    if (env != NULL) {
+        g_log_level = atoi(env);
+    }
+    /* else: keep default of 2 (warn level) */
+}
diff --git a/src/multiprocess/CMakeLists.txt b/src/multiprocess/CMakeLists.txt
@@ -3,6 +3,6 @@ add_library(multiprocess_mod OBJECT multiprocess_memory_limit.c multiprocess_uti
 target_compile_options(multiprocess_mod PUBLIC ${LIBRARY_COMPILE_FLAGS})
 target_link_libraries(multiprocess_mod PUBLIC nvidia-ml)
 
-add_executable(shrreg-tool shrreg_tool.c)
+add_executable(shrreg-tool shrreg_tool.c ${CMAKE_CURRENT_SOURCE_DIR}/../log_utils.c)
 target_link_libraries(shrreg-tool multiprocess_mod -lpthread -lcuda)
 
diff --git a/src/multiprocess/multiprocess_memory_limit.c b/src/multiprocess/multiprocess_memory_limit.c
@@ -212,14 +212,24 @@
 }
 
 void pre_launch_kernel() {
-    uint64_t now = time(NULL);
+    struct timespec ts;
+    clock_gettime(CLOCK_REALTIME_COARSE, &ts);
+    uint64_t now = (uint64_t)ts.tv_sec;
+
+    // Fast path: skip mutex if within recording interval (double-checked)
+    if (now - region_info.last_kernel_time < _record_kernel_interval) {
+        return;
+    }
+
     pthread_mutex_lock(&_kernel_mutex);
+    // Re-check under lock — another thread may have updated
     if (now - region_info.last_kernel_time < _record_kernel_interval) {
         pthread_mutex_unlock(&_kernel_mutex);
         return;
     }
     region_info.last_kernel_time = now;
     pthread_mutex_unlock(&_kernel_mutex);
+
     LOG_INFO("write last kernel time: %ld", now)
     // Lock-free update using atomic compare-exchange
     uint64_t expected = atomic_load_explicit(&region_info.shared_region->last_kernel_time, memory_order_acquire);
@@ -1046,10 +1056,20 @@
 }
 
 int wait_status_self(int status){
+    // Fast path: use cached slot pointer (set during init_proc_slot_withlock)
+    if (region_info.my_slot != NULL) {
+        int32_t cur = atomic_load_explicit(&region_info.my_slot->status, memory_order_acquire);
+        return (cur == status) ? 1 : 0;
+    }
+
+    // Slow path: linear scan (only if my_slot not yet cached)
     int i;
-    for (i=0;i<region_info.shared_region->proc_num;i++){
-        if (region_info.shared_region->procs[i].pid==getpid()){
-            if (region_info.shared_region->procs[i].status==status)
+    int proc_num = atomic_load_explicit(&region_info.shared_region->proc_num, memory_order_acquire);
+    int32_t my_pid = getpid();
+    for (i=0;i<proc_num;i++){
+        int32_t slot_pid = atomic_load_explicit(&region_info.shared_region->procs[i].pid, memory_order_acquire);
+        if (slot_pid==my_pid){
+            if (atomic_load_explicit(&region_info.shared_region->procs[i].status, memory_order_acquire)==status)
                 return 1;
             else
                 return 0;