gb10-uma-diagnostics

Unified memory diagnostic suite for NVIDIA GB10 (DGX Spark) Controlled measurement and system-level behavior classification

Overview

gb10-uma-diagnostics is a controlled measurement suite for analyzing unified memory behavior on GB10 systems.

It performs targeted experiments to measure:

Memory bandwidth (CPU and GPU)
CPU–GPU contention on shared memory
Atomic coherence cost (system vs GPU scope)
Memory stall behavior under pressure
Power and clock response during load

Why This Exists

On GB10 (DGX Spark), several unified memory signals are not fully exposed through current APIs:

NVML memory clock — not exposed (returns N/A)
Nsight Systems UVM tracing — unsupported
CUPTI UVM event collection — structurally absent on hardware-coherent UMA

Fine-grained unified memory behavior must be inferred from controlled experiments. CUPTI Activity instrumentation details will be documented once GB10 validation is complete.

Approach

controlled experiment → measured response → classification

Rather than relying on internal driver state, behavior is derived from:

bandwidth response
contention patterns
latency
PSI (stall signal)
power and clock changes

Diagnostic Model

Methodology

1. Controlled Experiment

Defined memory access patterns
CPU/GPU concurrency models
Repeatable workload conditions

2. Measured Response

Bandwidth (GB/s)
Latency (ns)
PSI (/proc/pressure/memory)
Power / clocks
System behavior under load

3. Classification

Convert signals into system-level interpretation

Signal Interpretation

Memory Contention: symptom: bandwidth drop under concurrent access interpretation: shared memory fabric contention

Memory Stall: symptom: PSI (memory) rising — especially "full" interpretation: scheduler blocked on memory

Power Limiting: symptom: clock reduction under load, power plateau interpretation: power or thermal constraint

Combined Effects: symptom: bandwidth drop + PSI rise + power increase interpretation: contention driving both memory stall and power response

Key Principle

PSI (/proc/pressure/memory) is the most reliable observable indicator of memory stall on GB10 systems where direct UVM telemetry is unavailable. PSI reflects time stalled, not allocation size — making it suitable for detecting failure conditions before they become unrecoverable.

Constraints

No direct UVM fault stream
No memory clock via NVML
Partial profiler support

Classification is based on externally observable behavior, not internal driver state.

Tools

uma_bw — Bandwidth Probe

Measures CPU and GPU bandwidth using PTX-level cache operators for true DRAM measurement. PTX read : ld.global.cg (L1 bypass) PTX write: st.global.cs (L2 bypass — true DRAM write)

Flags: --calibrate-peak empirical peak BW, no hardcoded spec --peak-from peak_calibration.json load peak, compute efficiency% --json-only

Build:

nvcc -O2 -std=c++17 -I./include uma_bandwidth_test.cu -o uma_bw -lcudart -lpthread

uma_contention — Contention Probe

Measures bandwidth degradation under CPU/GPU simultaneous memory access.

Modes: --mode gpu-read --mode gpu-write --mode cpu-read --mode cpu-write --mode cpu-read-gpu-read split buffer — parallel bandwidth --mode cpu-write-gpu-read same buffer — maximum contention --mode cpu-write-gpu-write same buffer — both writing --mode sweep all modes (default) --peak-from peak_calibration.json

Build:

nvcc -O2 -std=c++17 -I./include uma_contention.cu -o uma_contention -lcudart -lpthread

uma_atomic — Coherence Probe

Measures atomic coherence cost on hardware-coherent UMA. atom.global.gpu — GPU-scope atomic atom.global.sys — system-scope atomic (NVLink-C2C coherence path) SYS/GPU ratio — coherence overhead

Build:

nvcc -O2 -std=c++17 uma_atomic_test.cu -o uma_atomic -lcudart

spbm_analyzer.py — Power + Pressure Classifier

Reads spark_hwmon sensors, nvidia-smi, and PSI live. Classifies system state in real time and writes events to events.json.

Run:

python3 spbm_analyzer.py <outdir> [sparkview_anomaly_log]

run_correlated.sh — Experiment Orchestrator

Runs all tools in a controlled phased experiment with shared SPBM telemetry and timestamp alignment. Phase 0 pre-run check (clock, temp, SWAP) Phase 1 uma_bw — default clocks Phase 2 cooldown to baseline Phase 3 uma_bw — capped clocks Phase 4 cooldown to baseline Phase 5 uma_contention sweep Phase 6 package all outputs into timestamped zip

Quick Start

1. Build

nvcc -O2 -std=c++17 -I./include uma_bandwidth_test.cu -o uma_bw -lcudart -lpthread
nvcc -O2 -std=c++17 -I./include uma_contention.cu -o uma_contention -lcudart -lpthread
nvcc -O2 -std=c++17 uma_atomic_test.cu -o uma_atomic -lcudart

On GB10 — use CUDA 13.0 explicitly:

/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 -I./include uma_bandwidth_test.cu -o uma_bw -lcudart -lpthread
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 -I./include uma_contention.cu -o uma_contention -lcudart -lpthread

2. Calibrate peak bandwidth

./uma_bw --calibrate-peak

3. Run sparkview (recommended — in separate terminal)

cd ~/sparkview && source sparkview-venv/bin/activate && python3 main.py

4. Run full diagnostic

./run_correlated.sh

Output

Each run produces a timestamped zip containing: uma_bw_run1.txt default clock bandwidth uma_bw_run2.txt capped clock bandwidth uma_contention_sweep.txt full contention table uma_bw_results.json uma_contention_results.json peak_calibration.json spbm_*.txt raw power stream run_guard.log thermal guard log events.json classified events timeline.json nanosecond event log sparkview logs if sparkview was running

Interpreting Results

Bandwidth runs

Run 1 vs Run 2 delta: large delta → clock cap affects bandwidth (power-limited) small delta → bandwidth is memory-bound, not clock-bound

Contention sweep

cpu-write+gpu-read drop%: on DISCRETE_PCIE → PCIe contention (page migration) on GB10 UMA → LPDDR5X fabric arbitration

Events

MEMORY+POWER + CRITICAL → system approaching freeze MEMORY only → fabric saturated, clock still healthy POWER only → clock-limited, memory headroom remains

GB10 Confirmed Baselines

From community contributors (azampatti, pontostroy) — CUDA 13.0, driver 580.142: GPU read idle 161–166 GB/s GPU write idle 115–116 GB/s CPU read 7.6–7.7 GB/s UMA fault latency 16.5 ns p50 (40 cycles) COLD/WARM ratio 1.00x

Driver gaps confirmed: NVML memory clock N/A — use --calibrate-peak CUPTI UVM events CUPTI_ERROR_NOT_READY Peak BW from driver 0 GB/s

CUDA Version Requirement

CUDA 13.0 confirmed working on GB10 CUDA 13.1 %clock64 broken on GB10 — do not use CUDA 13.2 %clock64 returns 0, overflow — do not use

PTX Forward Compatibility

All PTX instructions are generic portable primitives — no architecture-specific suffixes.

Validate before running on new hardware:

CUDA_FORCE_PTX_JIT=1 ./uma_bw
CUDA_FORCE_PTX_JIT=1 ./uma_contention --mode gpu-read

Known Limitations

UVM internal state not directly observable
Memory clock unavailable via NVML
Classification based on inference from external signals
GB10 only — not designed for discrete PCIe platforms

Related Tools

sparkview — GB10-aware GPU monitor with PSI pressure and clock state detection
nvidia-uma-fault-probe — UMA fault latency and bandwidth (forum-facing stable version)

Community

NVIDIA Developer Forums baseline thread: https://forums.developer.nvidia.com/t/gb10-hardware-baseline-first-direct-measurements-and-findings/367851

Author

parallelArchitect Human-directed GPU engineering with AI assistance

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
include		include
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
analyze_events.py		analyze_events.py
collect_results.sh		collect_results.sh
gb10_fleet_compare.py		gb10_fleet_compare.py
hwmon_sweep.sh		hwmon_sweep.sh
probe_launcher.cu		probe_launcher.cu
run_all.sh		run_all.sh
run_correlated.sh		run_correlated.sh
spbm_analyzer.py		spbm_analyzer.py
spbm_replay.py		spbm_replay.py
uma_atomic_test.cu		uma_atomic_test.cu
uma_bandwidth_test.cu		uma_bandwidth_test.cu
uma_contention.cu		uma_contention.cu

Folders and files

Latest commit

History

Repository files navigation

gb10-uma-diagnostics

Overview

Why This Exists

Approach

Diagnostic Model

Methodology

Signal Interpretation

Key Principle

Constraints

Tools

uma_bw — Bandwidth Probe

uma_contention — Contention Probe

uma_atomic — Coherence Probe

spbm_analyzer.py — Power + Pressure Classifier

run_correlated.sh — Experiment Orchestrator

Quick Start

1. Build

2. Calibrate peak bandwidth

3. Run sparkview (recommended — in separate terminal)

4. Run full diagnostic

Output

Interpreting Results

Bandwidth runs

Contention sweep

Events

GB10 Confirmed Baselines

CUDA Version Requirement

PTX Forward Compatibility

Known Limitations

Related Tools

Community

Author

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages