Summary
We need a system for capturing performance benchmark timings at specific commits and comparing them against the stored baseline or other snapshots.
Problem
Currently test_perf.py only asserts pass/fail against hardcoded baselines with a 2x multiplier. There's no way to:
- See actual timings vs baseline in a table
- Save a snapshot at a commit for later comparison
- Compare two arbitrary snapshots (e.g. before/after a change)
Features
poetry run python bench_snapshot.py — run benchmarks, print comparison table against baselines
poetry run python bench_snapshot.py --save — save timings as a named snapshot (keyed by git short hash)
poetry run python bench_snapshot.py --compare <commit> — compare a saved snapshot to baselines
poetry run python bench_snapshot.py --list — list saved snapshots
Technical Considerations
- pytest
--durations includes test setup/teardown overhead (~0.3-0.5s per test for Textual app init), so timings won't match the pure operation baselines exactly
- Ideally instrument the tests to print actual operation timings (the
elapsed variable in each test) rather than relying on --durations
- Consider a pytest plugin or conftest fixture that captures and reports the measured
elapsed values
- Snapshots should be gitignored (machine-specific)
Summary
We need a system for capturing performance benchmark timings at specific commits and comparing them against the stored baseline or other snapshots.
Problem
Currently
test_perf.pyonly asserts pass/fail against hardcoded baselines with a 2x multiplier. There's no way to:Features
poetry run python bench_snapshot.py— run benchmarks, print comparison table against baselinespoetry run python bench_snapshot.py --save— save timings as a named snapshot (keyed by git short hash)poetry run python bench_snapshot.py --compare <commit>— compare a saved snapshot to baselinespoetry run python bench_snapshot.py --list— list saved snapshotsTechnical Considerations
--durationsincludes test setup/teardown overhead (~0.3-0.5s per test for Textual app init), so timings won't match the pure operation baselines exactlyelapsedvariable in each test) rather than relying on--durationselapsedvalues