feat: add roofline analysis (RooflineAnalyzer, measure_peaks) by nathanhubens · Pull Request #5 · FasterAI-Labs/fasterbench

nathanhubens · 2026-04-17T16:06:08Z

Summary

Adds roofline-model measurement primitives to fasterbench. Per the measurement-only contract, this module exposes numbers and a plot; compression decisions live in fasterrecipes.

New public API

measure_peaks(device, *, dtype, matmul_size, bandwidth_mb, warmup, steps, allow_tf32, cache) — empirical probe of achievable peak FLOPs/s (via large matmul) and streaming bandwidth (via cache-defeating memcpy). Pins TF32 off by default for honest fp32 peaks. Records tf32_enabled and cudnn_benchmark flags in the result.
HardwarePeaks — dataclass with peak_flops, peak_bandwidth, ridge_point, device, dtype, and the pinned flags.
RooflineAnalyzer(model, sample, peaks=None) — per-layer profiler with .profile(device, warmup, steps), .summary(top), .plot(title), .results. Single-pass forward hooks measure FLOPs (analytically for Conv{1,2,3}d, ConvTranspose{1,2,3}d, Linear), bytes moved (weights + input + output per Williams 2009), and wall time. Layers outside those types get bound="undefined" with a warning — no silent zeros.
RooflinePoint — per-layer result with flops, bytes_moved, time_s, arithmetic_intensity, achieved_gflops, bound ("memory"/"compute"/"undefined"), utilization_pct.
clear_peaks_cache() — explicit cache reset.

Deviation from the initial plan (worth flagging)

The plan called for a two-pass profile using torchprofile for FLOPs and a separate timing pass. The installed torchprofile build had a broken handler dispatch that silently returned zero FLOPs per layer, so the implementation switched to analytical FLOPs for Conv/Linear in a single pass. Trade-offs: exact math, no cross-pass drift, no external profiler dependency — at the cost of narrower layer-type coverage. Other layer types fall into the already-planned "undefined" bucket, so this was an architecturally clean forced deviation.

Tests

4 inline unit tests: basic peaks probe, cache identity, hand-computed Conv2d (flops=36864, bytes=4224, AI≈8.73), Linear stack with synthetic peaks.
2 #|slow integration tests: ResNet-18 CPU with synthetic peaks, CUDA smoke test guarded by is_available().
Tutorial at nbs/tutorials/roofline.ipynb showing hardware peaks, per-layer profiling on ResNet-18, and AI shift when input resolution changes from 224×224 to 512×512.

Architectural boundary

No imports from fasterai / fasterlatency / fasterrecipes.
No bottlenecks() / suggest() / recommendation helpers.
Tutorial avoids prescriptive language; includes a callout directing users to fasterrecipes for compression decisions.

Test plan

nbdev-test --path nbs/analysis/roofline.ipynb passes
nbdev-test --path nbs/analysis/roofline.ipynb --flags slow passes
nbdev-test (full suite) passes
Tutorial executes end-to-end (jupyter nbconvert --execute)
No prescriptive language ("consider", "recommend", "try", "should compress") in notebook or tutorial
No cross-package imports (only fasterbench.core / fasterbench.profiling + torch / numpy / plotly)
nbdev-clean leaves a clean checkout
CI passes on feature/roofline

Adds measurement primitives for compute-vs-memory-bound layer analysis: - measure_peaks(): empirical probe of peak FLOPs/s (matmul) and streaming bandwidth (cache-defeating memcpy). Pins TF32 off by default for honest fp32 peaks; caches per (device, dtype, sizes). - HardwarePeaks: dataclass with peak_flops, peak_bandwidth, ridge_point, and the flags under which they were measured. - RooflineAnalyzer: per-layer profiler with .profile() / .summary() / .plot(). Single-pass hooks measure FLOPs (analytical for Conv and Linear), bytes (weights + input + output, Williams 2009), and time. Classifies each layer as memory-bound or compute-bound; layers outside Conv/Linear land in an "undefined" bucket with a warning. - Plotly log-log roofline with transparent background and the project teal palette. Per the measurement-only contract, fasterbench exposes numbers and the plot; compression decisions belong in fasterrecipes. Includes: - API notebook nbs/analysis/roofline.ipynb with inline unit tests (hand-computed Conv2d flops/bytes, cache test, Linear stack) and #|slow integration tests (ResNet-18 CPU with synthetic peaks, CUDA smoke test guarded by is_available()). - Tutorial nbs/tutorials/roofline.ipynb showing hardware peaks, ResNet-18 profiling, and AI shift across input resolutions. - Sidebar + index.ipynb re-exports.

fasterbench's nbdev settings use tst_flags='notest' (not 'slow'), so #|slow cells were running in CI and failing on the torchvision import. Rename to #|notest to match package convention.

nathanhubens added 2 commits April 17, 2026 18:05

fix: use #|notest instead of #|slow for CI-skipped cells

d233df7

fasterbench's nbdev settings use tst_flags='notest' (not 'slow'), so #|slow cells were running in CI and failing on the torchvision import. Rename to #|notest to match package convention.

nathanhubens merged commit 2458c51 into master Apr 21, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add roofline analysis (RooflineAnalyzer, measure_peaks)#5

feat: add roofline analysis (RooflineAnalyzer, measure_peaks)#5
nathanhubens merged 2 commits into
masterfrom
feature/roofline

nathanhubens commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nathanhubens commented Apr 17, 2026

Summary

New public API

Deviation from the initial plan (worth flagging)

Tests

Architectural boundary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant