Skip to content

KOKOSde/nsys-llm-explainer

Repository files navigation

nsys-llm-explainer

CI License

nsys-llm-explainer hero diagram

Current release: v0.3.3

Why this exists

Nsight Systems SQLite exports are powerful but tedious to inspect by hand when you need an answer quickly.

This tool turns trace.sqlite into a prioritized report with explicit evidence:

  • Prioritized findings with severity, evidence, and concrete next actions
  • Top CUDA kernels and launch storms
  • Top CPU↔GPU barriers, including blocking memcpy and CPU launcher gaps
  • Top NCCL ops, with overlap against non-NCCL compute kernels
  • Per-process breakdowns for vLLM-style multi-process traces
  • NVLink during NCCL when NVLink counters are present
  • Exact capture instructions when required counters are missing

The repo is intentionally conservative:

  • It only claims NCCL/NVLink correlation when the exported SQLite data supports it.
  • If NVLink counters are missing, it prints NVLink counters not found and tells you exactly how to re-capture.
  • If only NCCL kernel names are available, it degrades to kernel-name-based NCCL detection instead of pretending it saw higher-level collectives.

Why this is different

  • It is evidence-first: every major section includes derivation and limitation notes, so conclusions are auditable.
  • It is safe by default: missing counters, weak PID attribution, or low NVTX coverage are surfaced as warnings instead of hidden.
  • It is workflow-ready: one run produces both human-readable report.md and analysis tables/JSON for downstream dashboards.
  • It supports regression checks: the dashboard accepts current + baseline traces and reports a direct top-kernel delta.

New: NCCL + NVLink + Barrier analysis

The fastest way to verify the current report shape is the committed synthetic example:

  • Example report with all current section headers: examples/synthetic/report.md
  • The example is generated from the synthetic SQLite fixture in tests/test_synthetic_sqlite.py, so it is small and reproducible.
  • Exact visible section names: Global critical path suspects, Top NCCL ops, NVLink during NCCL, Top CPU↔GPU barriers, Per-process breakdown
  • If the report says NVLink counters not found, jump to NVLink counters guidance for the re-capture command.

Install

python3 -m pip install -e .

For tests:

python3 -m pip install -e .[dev]

Run

nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/

Useful flags:

  • --print-schema: dump the detected SQLite tables/columns first
  • --phase-map phases.json: optional NVTX phase grouping

Dashboard

Dashboard dependencies are included in the base install:

python3 -m pip install -e .

Launch:

python3 -m nsys_llm_explainer.dashboard --db path/to/trace.sqlite

You can also launch from exported JSON:

python3 -m nsys_llm_explainer.dashboard --db artifacts/run_YYYYMMDD_HHMMSS/report.json

The dashboard provides SQLite-backed trace visualization with a kernel waterfall, roofline analysis, NCCL overlap summaries, NVLink correlation, and historical comparison mode (current vs baseline) in one scrollable dark-theme view.

Dashboard highlights:

  • Drag-and-drop current and baseline .sqlite/.json files for side-by-side trend checks.
  • KPI banner for total GPU time, total CPU time, detected bottleneck, framework hints, and baseline delta.
  • Upload from prior report.json is supported; when available, NVLink sidecar tables are auto-loaded.
  • Kernel waterfall with threshold filtering and baseline overlays.
  • Roofline scatter + top-50 timeline (with NCCL visibility toggle).
  • NCCL collective summaries with overlap, NVLink utilization timeline, stream-overlap summary, and launch-latency histogram.
  • Phase split and per-rank NCCL skew views for multi-rank runs.
  • One-click static export to dashboard_export_YYYYMMDD_HHMMSS.html.

Dashboard screenshots

NCCL collectives and NVLink utilization NCCL collectives and NVLink utilization: compare collective-time distribution against fabric activity to check whether communication cost and link usage line up.

Phase split and per-rank NCCL skew Phase split and per-rank NCCL skew: identify the dominant high-level phase and spot ranks that trail peers during NCCL operations.

Roofline scatter and timeline top-50 kernels Roofline scatter and timeline top-50 kernels: determine whether dominant kernels are compute-bound or bandwidth-bound and where the heaviest kernels land in time.

Stream concurrency and launch latency Stream concurrency and launch latency: diagnose overlap quality, CPU launch overhead, and long-tail launch behavior.

Kernel waterfall Kernel waterfall: inspect execution ordering and quickly spot long kernels, bursty launches, and synchronization boundaries across streams.

Hugging Face Space (Gradio)

A Space-ready Gradio wrapper lives in:

Space title:

  • nsys-llm-explainer — Instant Nsight Trace Analyzer for Cloud LLM Inference

This wrapper accepts .sqlite or report.json, renders quick findings/summary tabs, and exposes report/CSV downloads.

API service + client

Install API/client extras:

python3 -m pip install -e .[api,client]

Run service:

nsys-llm-api --host 0.0.0.0 --port 8080

Optional API-key protection:

export NSYS_API_KEY="change-me"
nsys-llm-api --host 0.0.0.0 --port 8080

Python client and curl usage:

Production deploy kits

Container and cloud quickstarts:

Hugging Face application pack

Role-targeted resume bullets, cover letters, and execution plan:

What the report includes

The generated output directory contains:

  • report.md, report.json
  • report.json includes findings (severity + evidence + recommendations) and explicit warnings
  • tables/kernels.csv
  • tables/barriers.csv
  • tables/nccl_ops.csv
  • tables/nccl_rank_skew.csv
  • tables/nccl_by_pid.csv
  • tables/nvlink_during_nccl.csv
  • tables/nvlink_timeseries.csv
  • tables/timeline_events.csv
  • tables/copy_engine_events.csv
  • tables/launch_latency_rows.csv
  • tables/launch_latency_histogram.csv
  • tables/stream_overlap.csv
  • tables/phase_split.csv
  • tables/roofline.csv
  • tables/gpu_idle_gaps.csv
  • tables/kernels_by_pid.csv, tables/sync_by_pid.csv, tables/nvtx_by_pid.csv, tables/nvtx_ranges.csv

New report sections:

  • Global critical path suspects
  • Top NCCL ops
  • NVLink during NCCL
  • Top CPU↔GPU barriers
  • Per-process breakdown

Additional dashboard-ready metrics (stored in report.json + tables) include:

  • Timeline top events
  • Copy engine activity
  • Launch latency distribution
  • Stream overlap summary
  • Phase split
  • Roofline metrics

Capture recipes

1. Vanilla CUDA workloads

Use this when you want kernels, launch gaps, barriers, NVTX, and CUDA-graphs-aware capture:

nsys profile \
  --trace=cuda,nvtx,osrt \
  --sample=none \
  --cpuctxsw=none \
  --cuda-trace-scope=process-tree \
  --cuda-graph-trace=node \
  -o trace \
  python your_workload.py

nsys export \
  --type sqlite \
  --output trace.sqlite \
  --force-overwrite=true \
  --lazy=false \
  trace.nsys-rep

nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/

Notes:

  • --cuda-trace-scope=process-tree keeps child processes in the trace, which matters for vLLM-style worker processes.
  • --cuda-graph-trace=node is recommended when the workload uses CUDA Graphs.

2. NCCL multi-process / multi-node

Use this when you want NCCL ops to survive export cleanly:

nsys profile \
  --trace=nccl,cuda,nvtx,osrt \
  --nccl-trace=all \
  --sample=none \
  --cpuctxsw=none \
  --cuda-trace-scope=process-tree \
  --cuda-graph-trace=node \
  -o trace \
  torchrun --nproc_per_node=... your_workload.py

nsys export \
  --type sqlite \
  --include-json true \
  --output trace.sqlite \
  --force-overwrite=true \
  --lazy=false \
  trace.nsys-rep

Why --include-json true matters:

  • Nsight Systems exports some event classes, including NVTX events with user-defined payloads, only when JSON export is included.
  • Recent NCCL tracing support in Nsight Systems surfaces NCCL activity as NVTX-backed events in the exported database.

This tool will still fall back to runtime API names or NCCL kernel names when those richer NCCL events are absent.

3. NVLink counters guidance

The tool can only report NVLink during NCCL when the SQLite export contains GPU Metrics tables with NVLink-related metrics.

First, list the supported GPU metric sets on your machine:

nsys profile --gpu-metrics-devices=all --gpu-metrics-set=help

Then re-capture with a supported metric set:

sudo nsys profile \
  --trace=nccl,cuda,nvtx,osrt \
  --nccl-trace=all \
  --sample=none \
  --cpuctxsw=none \
  --cuda-trace-scope=process-tree \
  --cuda-graph-trace=node \
  --gpu-metrics-devices=all \
  --gpu-metrics-set=<supported-set> \
  --gpu-metrics-frequency=10000 \
  -o trace \
  torchrun --nproc_per_node=... your_workload.py

nsys export \
  --type sqlite \
  --include-json true \
  --output trace.sqlite \
  --force-overwrite=true \
  --lazy=false \
  trace.nsys-rep

Notes:

  • On Linux, GPU Metrics collection typically requires elevated permissions.
  • When those counters are missing, the report prints NVLink counters not found instead of inventing a correlation result.

Sample output

The snippets below are generated from the synthetic SQLite fixture used in tests, not from a real GPU capture.

CLI sample:

Wrote report to: /tmp/.../out/report.md
Top kernel: computeKernel (42.6% of kernel time, 2.6 ms, 3 calls)
Top barrier: cudaStreamSynchronize [sync_api] (0.8 ms, 1 events)
Top NCCL op: allreduce (2.0 ms total, 2.0 ms max, overlap 50.0%)
Launch storm: 5 launches over 0.005s = 909.1 launches/s; median kernel 1000.00 us
GPU idle estimate: GPU 0: 18.2% idle (1.0 ms / 5.5 ms window)
NVLink during NCCL: NVLink counters not found

Report sample:

## Global critical path suspects
| kind | name | total_ms | count | details |
| kernel | computeKernel | 2.600 | 3 | 42.6% of kernel time |
| nccl | allreduce | 2.000 | 1 | max 2.000 ms |

## Top NCCL ops
| op_name | source | total_time_ms | max_duration_ms | count | compute_overlap_ms | compute_overlap_pct | straggler |
| allreduce | kernel | 2.000 | 2.000 | 1 | 1.000 | 50.0 | pid:111 |
| broadcast | kernel | 1.500 | 1.500 | 1 | 0.600 | 40.0 | pid:222 |

## Top CPU↔GPU barriers
| barrier_kind | api_name | total_time_ms | count | avg_duration_us | max_duration_us |
| sync_api | cudaStreamSynchronize | 0.800 | 1 | 800.00 | 800.00 |
| sync_api | cudaDeviceSynchronize | 0.700 | 1 | 700.00 | 700.00 |
| blocking_memcpy | cudaMemcpy | 0.600 | 1 | 600.00 | 600.00 |
| cpu_launcher_gap | cpu_launcher_gap | 0.200 | 1 | 200.00 | 200.00 |

When NVLink metrics are present, the synthetic fixture produces a row like:

metric_source_id: 0
metric_names: NVLink bytes received, NVLink bytes transmitted
avg_metric_during_nccl: 76.67
avg_metric_outside_nccl: 5.83
nccl_activity_correlation: 0.990

Committed examples

  • examples/synthetic/report.md: synthetic, fixture-generated, and guaranteed to show the new NCCL/NVLink/barrier/per-process sections.
  • examples/a100_vllm/: historical real output-only example from an A100 vLLM run.

Schema compatibility

Nsight Systems SQLite schema varies by version and by capture options.

This tool probes the schema at runtime and degrades gracefully:

  • String table: prefers StringIds(id, value)
  • Kernels: prefers CUPTI_ACTIVITY_KIND_KERNEL, falls back to concurrent-kernel variants
  • Runtime API: prefers CUPTI_ACTIVITY_KIND_RUNTIME
  • NVTX: prefers NVTX_EVENTS
  • GPU Metrics: looks for GPU_METRICS and TARGET_INFO_GPU_METRICS
  • CUDA Graphs capture awareness: the README and report assume --cuda-graph-trace=node for graph-heavy workloads

If a section cannot be computed, the report says so explicitly instead of silently omitting the limitation.

Reproduce locally

Install:

python3 -m pip install -e .[dev]

Run on a trace:

nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/

Run tests:

python3 -m pytest -q

References

Primary NVIDIA documentation used for the capture/export guidance:

  • Nsight Systems User Guide: GPU metrics collection, --gpu-metrics-set=help, and required permissions
  • Nsight Systems User Guide: --cuda-graph-trace=node
  • Nsight Systems User Guide: --cuda-trace-scope=process-tree
  • Nsight Systems User Guide: NCCL tracing and --nccl-trace=all
  • Nsight Systems User Guide: SQLite export --include-json true

About

Offline Nsight Systems SQLite explainer for LLM inference hotspots

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors