GitHub - KOKOSde/nsys-llm-explainer: Offline Nsight Systems SQLite explainer for LLM inference hotspots

nsys-llm-explainer

Current release: v0.3.3

Why this exists

Nsight Systems SQLite exports are powerful but tedious to inspect by hand when you need an answer quickly.

This tool turns trace.sqlite into a prioritized report with explicit evidence:

Prioritized findings with severity, evidence, and concrete next actions
Top CUDA kernels and launch storms
Top CPU↔GPU barriers, including blocking memcpy and CPU launcher gaps
Top NCCL ops, with overlap against non-NCCL compute kernels
Per-process breakdowns for vLLM-style multi-process traces
NVLink during NCCL when NVLink counters are present
Exact capture instructions when required counters are missing

The repo is intentionally conservative:

It only claims NCCL/NVLink correlation when the exported SQLite data supports it.
If NVLink counters are missing, it prints NVLink counters not found and tells you exactly how to re-capture.
If only NCCL kernel names are available, it degrades to kernel-name-based NCCL detection instead of pretending it saw higher-level collectives.

Why this is different

It is evidence-first: every major section includes derivation and limitation notes, so conclusions are auditable.
It is safe by default: missing counters, weak PID attribution, or low NVTX coverage are surfaced as warnings instead of hidden.
It is workflow-ready: one run produces both human-readable report.md and analysis tables/JSON for downstream dashboards.
It supports regression checks: the dashboard accepts current + baseline traces and reports a direct top-kernel delta.

New: NCCL + NVLink + Barrier analysis

The fastest way to verify the current report shape is the committed synthetic example:

Example report with all current section headers: examples/synthetic/report.md
The example is generated from the synthetic SQLite fixture in tests/test_synthetic_sqlite.py, so it is small and reproducible.
Exact visible section names: Global critical path suspects, Top NCCL ops, NVLink during NCCL, Top CPU↔GPU barriers, Per-process breakdown
If the report says NVLink counters not found, jump to NVLink counters guidance for the re-capture command.

Install

python3 -m pip install -e .

For tests:

python3 -m pip install -e .[dev]

Run

nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/

Useful flags:

--print-schema: dump the detected SQLite tables/columns first
--phase-map phases.json: optional NVTX phase grouping

Dashboard

Dashboard dependencies are included in the base install:

python3 -m pip install -e .

Launch:

python3 -m nsys_llm_explainer.dashboard --db path/to/trace.sqlite

You can also launch from exported JSON:

python3 -m nsys_llm_explainer.dashboard --db artifacts/run_YYYYMMDD_HHMMSS/report.json

The dashboard provides SQLite-backed trace visualization with a kernel waterfall, roofline analysis, NCCL overlap summaries, NVLink correlation, and historical comparison mode (current vs baseline) in one scrollable dark-theme view.

Dashboard highlights:

Drag-and-drop current and baseline .sqlite/.json files for side-by-side trend checks.
KPI banner for total GPU time, total CPU time, detected bottleneck, framework hints, and baseline delta.
Upload from prior report.json is supported; when available, NVLink sidecar tables are auto-loaded.
Kernel waterfall with threshold filtering and baseline overlays.
Roofline scatter + top-50 timeline (with NCCL visibility toggle).
NCCL collective summaries with overlap, NVLink utilization timeline, stream-overlap summary, and launch-latency histogram.
Phase split and per-rank NCCL skew views for multi-rank runs.
One-click static export to dashboard_export_YYYYMMDD_HHMMSS.html.

Dashboard screenshots

NCCL collectives and NVLink utilization: compare collective-time distribution against fabric activity to check whether communication cost and link usage line up.

Phase split and per-rank NCCL skew: identify the dominant high-level phase and spot ranks that trail peers during NCCL operations.

Roofline scatter and timeline top-50 kernels: determine whether dominant kernels are compute-bound or bandwidth-bound and where the heaviest kernels land in time.

Stream concurrency and launch latency: diagnose overlap quality, CPU launch overhead, and long-tail launch behavior.

Kernel waterfall: inspect execution ordering and quickly spot long kernels, bursty launches, and synchronization boundaries across streams.

Hugging Face Space (Gradio)

A Space-ready Gradio wrapper lives in:

Space title:

nsys-llm-explainer — Instant Nsight Trace Analyzer for Cloud LLM Inference

This wrapper accepts .sqlite or report.json, renders quick findings/summary tabs, and exposes report/CSV downloads.

API service + client

Install API/client extras:

python3 -m pip install -e .[api,client]

Run service:

nsys-llm-api --host 0.0.0.0 --port 8080

Optional API-key protection:

export NSYS_API_KEY="change-me"
nsys-llm-api --host 0.0.0.0 --port 8080

Python client and curl usage:

Production deploy kits

Container and cloud quickstarts:

Hugging Face application pack

Role-targeted resume bullets, cover letters, and execution plan:

What the report includes

The generated output directory contains:

report.md, report.json
report.json includes findings (severity + evidence + recommendations) and explicit warnings
tables/kernels.csv
tables/barriers.csv
tables/nccl_ops.csv
tables/nccl_rank_skew.csv
tables/nccl_by_pid.csv
tables/nvlink_during_nccl.csv
tables/nvlink_timeseries.csv
tables/timeline_events.csv
tables/copy_engine_events.csv
tables/launch_latency_rows.csv
tables/launch_latency_histogram.csv
tables/stream_overlap.csv
tables/phase_split.csv
tables/roofline.csv
tables/gpu_idle_gaps.csv
tables/kernels_by_pid.csv, tables/sync_by_pid.csv, tables/nvtx_by_pid.csv, tables/nvtx_ranges.csv

New report sections:

Global critical path suspects
Top NCCL ops
NVLink during NCCL
Top CPU↔GPU barriers
Per-process breakdown

Additional dashboard-ready metrics (stored in report.json + tables) include:

Timeline top events
Copy engine activity
Launch latency distribution
Stream overlap summary
Phase split
Roofline metrics

Capture recipes

1. Vanilla CUDA workloads

Use this when you want kernels, launch gaps, barriers, NVTX, and CUDA-graphs-aware capture:

nsys profile \
  --trace=cuda,nvtx,osrt \
  --sample=none \
  --cpuctxsw=none \
  --cuda-trace-scope=process-tree \
  --cuda-graph-trace=node \
  -o trace \
  python your_workload.py

nsys export \
  --type sqlite \
  --output trace.sqlite \
  --force-overwrite=true \
  --lazy=false \
  trace.nsys-rep

nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/

Notes:

--cuda-trace-scope=process-tree keeps child processes in the trace, which matters for vLLM-style worker processes.
--cuda-graph-trace=node is recommended when the workload uses CUDA Graphs.

2. NCCL multi-process / multi-node

Use this when you want NCCL ops to survive export cleanly:

nsys profile \
  --trace=nccl,cuda,nvtx,osrt \
  --nccl-trace=all \
  --sample=none \
  --cpuctxsw=none \
  --cuda-trace-scope=process-tree \
  --cuda-graph-trace=node \
  -o trace \
  torchrun --nproc_per_node=... your_workload.py

nsys export \
  --type sqlite \
  --include-json true \
  --output trace.sqlite \
  --force-overwrite=true \
  --lazy=false \
  trace.nsys-rep

Why --include-json true matters:

Nsight Systems exports some event classes, including NVTX events with user-defined payloads, only when JSON export is included.
Recent NCCL tracing support in Nsight Systems surfaces NCCL activity as NVTX-backed events in the exported database.

This tool will still fall back to runtime API names or NCCL kernel names when those richer NCCL events are absent.

3. NVLink counters guidance

The tool can only report NVLink during NCCL when the SQLite export contains GPU Metrics tables with NVLink-related metrics.

First, list the supported GPU metric sets on your machine:

nsys profile --gpu-metrics-devices=all --gpu-metrics-set=help

Then re-capture with a supported metric set:

sudo nsys profile \
  --trace=nccl,cuda,nvtx,osrt \
  --nccl-trace=all \
  --sample=none \
  --cpuctxsw=none \
  --cuda-trace-scope=process-tree \
  --cuda-graph-trace=node \
  --gpu-metrics-devices=all \
  --gpu-metrics-set=<supported-set> \
  --gpu-metrics-frequency=10000 \
  -o trace \
  torchrun --nproc_per_node=... your_workload.py

nsys export \
  --type sqlite \
  --include-json true \
  --output trace.sqlite \
  --force-overwrite=true \
  --lazy=false \
  trace.nsys-rep

Notes:

On Linux, GPU Metrics collection typically requires elevated permissions.
When those counters are missing, the report prints NVLink counters not found instead of inventing a correlation result.

Sample output

The snippets below are generated from the synthetic SQLite fixture used in tests, not from a real GPU capture.

CLI sample:

Wrote report to: /tmp/.../out/report.md
Top kernel: computeKernel (42.6% of kernel time, 2.6 ms, 3 calls)
Top barrier: cudaStreamSynchronize [sync_api] (0.8 ms, 1 events)
Top NCCL op: allreduce (2.0 ms total, 2.0 ms max, overlap 50.0%)
Launch storm: 5 launches over 0.005s = 909.1 launches/s; median kernel 1000.00 us
GPU idle estimate: GPU 0: 18.2% idle (1.0 ms / 5.5 ms window)
NVLink during NCCL: NVLink counters not found

Report sample:

## Global critical path suspects
| kind | name | total_ms | count | details |
| kernel | computeKernel | 2.600 | 3 | 42.6% of kernel time |
| nccl | allreduce | 2.000 | 1 | max 2.000 ms |

## Top NCCL ops
| op_name | source | total_time_ms | max_duration_ms | count | compute_overlap_ms | compute_overlap_pct | straggler |
| allreduce | kernel | 2.000 | 2.000 | 1 | 1.000 | 50.0 | pid:111 |
| broadcast | kernel | 1.500 | 1.500 | 1 | 0.600 | 40.0 | pid:222 |

## Top CPU↔GPU barriers
| barrier_kind | api_name | total_time_ms | count | avg_duration_us | max_duration_us |
| sync_api | cudaStreamSynchronize | 0.800 | 1 | 800.00 | 800.00 |
| sync_api | cudaDeviceSynchronize | 0.700 | 1 | 700.00 | 700.00 |
| blocking_memcpy | cudaMemcpy | 0.600 | 1 | 600.00 | 600.00 |
| cpu_launcher_gap | cpu_launcher_gap | 0.200 | 1 | 200.00 | 200.00 |

When NVLink metrics are present, the synthetic fixture produces a row like:

metric_source_id: 0
metric_names: NVLink bytes received, NVLink bytes transmitted
avg_metric_during_nccl: 76.67
avg_metric_outside_nccl: 5.83
nccl_activity_correlation: 0.990

Committed examples

examples/synthetic/report.md: synthetic, fixture-generated, and guaranteed to show the new NCCL/NVLink/barrier/per-process sections.
examples/a100_vllm/: historical real output-only example from an A100 vLLM run.

Schema compatibility

Nsight Systems SQLite schema varies by version and by capture options.

This tool probes the schema at runtime and degrades gracefully:

String table: prefers StringIds(id, value)
Kernels: prefers CUPTI_ACTIVITY_KIND_KERNEL, falls back to concurrent-kernel variants
Runtime API: prefers CUPTI_ACTIVITY_KIND_RUNTIME
NVTX: prefers NVTX_EVENTS
GPU Metrics: looks for GPU_METRICS and TARGET_INFO_GPU_METRICS
CUDA Graphs capture awareness: the README and report assume --cuda-graph-trace=node for graph-heavy workloads

If a section cannot be computed, the report says so explicitly instead of silently omitting the limitation.

Reproduce locally

Install:

python3 -m pip install -e .[dev]

Run on a trace:

nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/

Run tests:

python3 -m pytest -q

References

Primary NVIDIA documentation used for the capture/export guidance:

Nsight Systems User Guide: GPU metrics collection, --gpu-metrics-set=help, and required permissions
Nsight Systems User Guide: --cuda-graph-trace=node
Nsight Systems User Guide: --cuda-trace-scope=process-tree
Nsight Systems User Guide: NCCL tracing and --nccl-trace=all
Nsight Systems User Guide: SQLite export --include-json true

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
career		career
deploy		deploy
docs		docs
examples		examples
spaces/hf_space		spaces/hf_space
src/nsys_llm_explainer		src/nsys_llm_explainer
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
RUNBOOK.md		RUNBOOK.md
SECURITY.md		SECURITY.md
VERSION		VERSION
capture_nsys_a100.sbatch		capture_nsys_a100.sbatch
capture_nsys_a100_tp2_nccl.sbatch		capture_nsys_a100_tp2_nccl.sbatch
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nsys-llm-explainer

Why this exists

Why this is different

New: NCCL + NVLink + Barrier analysis

Install

Run

Dashboard

Dashboard screenshots

Hugging Face Space (Gradio)

API service + client

Production deploy kits

Hugging Face application pack

What the report includes

Capture recipes

1. Vanilla CUDA workloads

2. NCCL multi-process / multi-node

3. NVLink counters guidance

Sample output

Committed examples

Schema compatibility

Reproduce locally

References

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nsys-llm-explainer

Why this exists

Why this is different

New: NCCL + NVLink + Barrier analysis

Install

Run

Dashboard

Dashboard screenshots

Hugging Face Space (Gradio)

API service + client

Production deploy kits

Hugging Face application pack

What the report includes

Capture recipes

1. Vanilla CUDA workloads

2. NCCL multi-process / multi-node

3. NVLink counters guidance

Sample output

Committed examples

Schema compatibility

Reproduce locally

References

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages