feat: add roofline analysis (RooflineAnalyzer, measure_peaks)#5
Merged
Conversation
Adds measurement primitives for compute-vs-memory-bound layer analysis: - measure_peaks(): empirical probe of peak FLOPs/s (matmul) and streaming bandwidth (cache-defeating memcpy). Pins TF32 off by default for honest fp32 peaks; caches per (device, dtype, sizes). - HardwarePeaks: dataclass with peak_flops, peak_bandwidth, ridge_point, and the flags under which they were measured. - RooflineAnalyzer: per-layer profiler with .profile() / .summary() / .plot(). Single-pass hooks measure FLOPs (analytical for Conv and Linear), bytes (weights + input + output, Williams 2009), and time. Classifies each layer as memory-bound or compute-bound; layers outside Conv/Linear land in an "undefined" bucket with a warning. - Plotly log-log roofline with transparent background and the project teal palette. Per the measurement-only contract, fasterbench exposes numbers and the plot; compression decisions belong in fasterrecipes. Includes: - API notebook nbs/analysis/roofline.ipynb with inline unit tests (hand-computed Conv2d flops/bytes, cache test, Linear stack) and #|slow integration tests (ResNet-18 CPU with synthetic peaks, CUDA smoke test guarded by is_available()). - Tutorial nbs/tutorials/roofline.ipynb showing hardware peaks, ResNet-18 profiling, and AI shift across input resolutions. - Sidebar + index.ipynb re-exports.
fasterbench's nbdev settings use tst_flags='notest' (not 'slow'), so #|slow cells were running in CI and failing on the torchvision import. Rename to #|notest to match package convention.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds roofline-model measurement primitives to fasterbench. Per the measurement-only contract, this module exposes numbers and a plot; compression decisions live in fasterrecipes.
New public API
measure_peaks(device, *, dtype, matmul_size, bandwidth_mb, warmup, steps, allow_tf32, cache)— empirical probe of achievable peak FLOPs/s (via large matmul) and streaming bandwidth (via cache-defeating memcpy). Pins TF32 off by default for honest fp32 peaks. Recordstf32_enabledandcudnn_benchmarkflags in the result.HardwarePeaks— dataclass withpeak_flops,peak_bandwidth,ridge_point,device,dtype, and the pinned flags.RooflineAnalyzer(model, sample, peaks=None)— per-layer profiler with.profile(device, warmup, steps),.summary(top),.plot(title),.results. Single-pass forward hooks measure FLOPs (analytically forConv{1,2,3}d,ConvTranspose{1,2,3}d,Linear), bytes moved (weights + input + output per Williams 2009), and wall time. Layers outside those types getbound="undefined"with a warning — no silent zeros.RooflinePoint— per-layer result withflops,bytes_moved,time_s,arithmetic_intensity,achieved_gflops,bound("memory"/"compute"/"undefined"),utilization_pct.clear_peaks_cache()— explicit cache reset.Deviation from the initial plan (worth flagging)
The plan called for a two-pass profile using torchprofile for FLOPs and a separate timing pass. The installed torchprofile build had a broken handler dispatch that silently returned zero FLOPs per layer, so the implementation switched to analytical FLOPs for Conv/Linear in a single pass. Trade-offs: exact math, no cross-pass drift, no external profiler dependency — at the cost of narrower layer-type coverage. Other layer types fall into the already-planned
"undefined"bucket, so this was an architecturally clean forced deviation.Tests
#|slowintegration tests: ResNet-18 CPU with synthetic peaks, CUDA smoke test guarded byis_available().nbs/tutorials/roofline.ipynbshowing hardware peaks, per-layer profiling on ResNet-18, and AI shift when input resolution changes from 224×224 to 512×512.Architectural boundary
bottlenecks()/suggest()/ recommendation helpers.Test plan
nbdev-test --path nbs/analysis/roofline.ipynbpassesnbdev-test --path nbs/analysis/roofline.ipynb --flags slowpassesnbdev-test(full suite) passesjupyter nbconvert --execute)nbdev-cleanleaves a clean checkout