Skip to content

feat: add roofline analysis (RooflineAnalyzer, measure_peaks)#5

Merged
nathanhubens merged 2 commits into
masterfrom
feature/roofline
Apr 21, 2026
Merged

feat: add roofline analysis (RooflineAnalyzer, measure_peaks)#5
nathanhubens merged 2 commits into
masterfrom
feature/roofline

Conversation

@nathanhubens
Copy link
Copy Markdown
Contributor

Summary

Adds roofline-model measurement primitives to fasterbench. Per the measurement-only contract, this module exposes numbers and a plot; compression decisions live in fasterrecipes.

New public API

  • measure_peaks(device, *, dtype, matmul_size, bandwidth_mb, warmup, steps, allow_tf32, cache) — empirical probe of achievable peak FLOPs/s (via large matmul) and streaming bandwidth (via cache-defeating memcpy). Pins TF32 off by default for honest fp32 peaks. Records tf32_enabled and cudnn_benchmark flags in the result.
  • HardwarePeaks — dataclass with peak_flops, peak_bandwidth, ridge_point, device, dtype, and the pinned flags.
  • RooflineAnalyzer(model, sample, peaks=None) — per-layer profiler with .profile(device, warmup, steps), .summary(top), .plot(title), .results. Single-pass forward hooks measure FLOPs (analytically for Conv{1,2,3}d, ConvTranspose{1,2,3}d, Linear), bytes moved (weights + input + output per Williams 2009), and wall time. Layers outside those types get bound="undefined" with a warning — no silent zeros.
  • RooflinePoint — per-layer result with flops, bytes_moved, time_s, arithmetic_intensity, achieved_gflops, bound ("memory"/"compute"/"undefined"), utilization_pct.
  • clear_peaks_cache() — explicit cache reset.

Deviation from the initial plan (worth flagging)

The plan called for a two-pass profile using torchprofile for FLOPs and a separate timing pass. The installed torchprofile build had a broken handler dispatch that silently returned zero FLOPs per layer, so the implementation switched to analytical FLOPs for Conv/Linear in a single pass. Trade-offs: exact math, no cross-pass drift, no external profiler dependency — at the cost of narrower layer-type coverage. Other layer types fall into the already-planned "undefined" bucket, so this was an architecturally clean forced deviation.

Tests

  • 4 inline unit tests: basic peaks probe, cache identity, hand-computed Conv2d (flops=36864, bytes=4224, AI≈8.73), Linear stack with synthetic peaks.
  • 2 #|slow integration tests: ResNet-18 CPU with synthetic peaks, CUDA smoke test guarded by is_available().
  • Tutorial at nbs/tutorials/roofline.ipynb showing hardware peaks, per-layer profiling on ResNet-18, and AI shift when input resolution changes from 224×224 to 512×512.

Architectural boundary

  • No imports from fasterai / fasterlatency / fasterrecipes.
  • No bottlenecks() / suggest() / recommendation helpers.
  • Tutorial avoids prescriptive language; includes a callout directing users to fasterrecipes for compression decisions.

Test plan

  • nbdev-test --path nbs/analysis/roofline.ipynb passes
  • nbdev-test --path nbs/analysis/roofline.ipynb --flags slow passes
  • nbdev-test (full suite) passes
  • Tutorial executes end-to-end (jupyter nbconvert --execute)
  • No prescriptive language ("consider", "recommend", "try", "should compress") in notebook or tutorial
  • No cross-package imports (only fasterbench.core / fasterbench.profiling + torch / numpy / plotly)
  • nbdev-clean leaves a clean checkout
  • CI passes on feature/roofline

Adds measurement primitives for compute-vs-memory-bound layer analysis:

- measure_peaks(): empirical probe of peak FLOPs/s (matmul) and streaming
  bandwidth (cache-defeating memcpy). Pins TF32 off by default for honest
  fp32 peaks; caches per (device, dtype, sizes).
- HardwarePeaks: dataclass with peak_flops, peak_bandwidth, ridge_point,
  and the flags under which they were measured.
- RooflineAnalyzer: per-layer profiler with .profile() / .summary() /
  .plot(). Single-pass hooks measure FLOPs (analytical for Conv and
  Linear), bytes (weights + input + output, Williams 2009), and time.
  Classifies each layer as memory-bound or compute-bound; layers outside
  Conv/Linear land in an "undefined" bucket with a warning.
- Plotly log-log roofline with transparent background and the project
  teal palette.

Per the measurement-only contract, fasterbench exposes numbers and the
plot; compression decisions belong in fasterrecipes.

Includes:
- API notebook nbs/analysis/roofline.ipynb with inline unit tests
  (hand-computed Conv2d flops/bytes, cache test, Linear stack) and
  #|slow integration tests (ResNet-18 CPU with synthetic peaks,
  CUDA smoke test guarded by is_available()).
- Tutorial nbs/tutorials/roofline.ipynb showing hardware peaks,
  ResNet-18 profiling, and AI shift across input resolutions.
- Sidebar + index.ipynb re-exports.
fasterbench's nbdev settings use tst_flags='notest' (not 'slow'),
so #|slow cells were running in CI and failing on the torchvision
import. Rename to #|notest to match package convention.
@nathanhubens nathanhubens merged commit 2458c51 into master Apr 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant