Skip to content

Latest commit

 

History

History
184 lines (145 loc) · 9.65 KB

File metadata and controls

184 lines (145 loc) · 9.65 KB

ExcelBench — Sprint Tracker

Single source of truth for the 7-Dimension Extension initiative. Each row tracks one self-contained sprint (one branch, one PR, one row flip). Resume cold by reading this file and the most recent [*INCOMPLETE*] marker.

Last updated: 2026-04-27 (S3 shipped)

Status Table

# Dimension Status Sprint size Branch PR Acceptance commit range
S1 Memory honesty + Tracker bootstrap Shipped S (3–5 d) feat/perf-mem-honesty #28 50dc104..HEAD@PR#28
S2 Data shape (int/str/date/formula) Shipped M (1 wk) feat/perf-data-shape #31 373c896..cbb530f
S3 File shape (wide/tall/sparse) Shipped M (1 wk) feat/perf-file-shape #32 eb89aba..c7cf25d
S4 High-cost operations Planned M (1 wk) feat/perf-operations
S5 Workbook complexity perf Planned M (1 wk) feat/perf-complexity
S6 Cold-start / warm path Planned S (3–5 d) feat/perf-cold-start
S7 Round-trip fidelity (LibreOffice) Planned L (~2 wk) feat/fidelity-roundtrip

Status legend: PlannedIn ProgressShipped (or Blocked with reason).

How to Flip a Row

When a sprint lands:

  1. Update the row's Status to Shipped.
  2. Fill in the PR column (#NN).
  3. Fill in Acceptance commit range (abc1234..def5678).
  4. Bump the Last updated line at the top of this file.
  5. Append a sprint acceptance entry (template below) to the Acceptance Notes section.
  6. Add the corresponding DEC-NNN entry to decisions.md if not already done.

If a sprint stalls, switch its status to Blocked and add a one-line reason in the row.

Sprint Acceptance Template

Use this template when appending to Acceptance Notes below.

### S<N> — <Dimension> (YYYY-MM-DD)

**Branch**: `feat/...`  ·  **PR**: #NN  ·  **Commit range**: `abc1234..def5678`

**What shipped**:
- <one-line bullet per major piece>

**Verification**:
- `uv run pytest tests/`- `uv run ruff check src/ tests/`- `uv run mypy src/`- `excelbench <new-subcommand> ...` ✓ (16 adapters, no crashes)
- Dashboard regenerated, results.json + history.jsonl appended.

**Decisions**: DEC-NNN logged in `decisions.md`.

**Deferred / out-of-scope**:
- <items intentionally left for follow-up>

Acceptance Notes

S3 — File shape (wide/tall/sparse) (2026-04-27)

Branch: feat/perf-file-shape · PR: #32 · Commit range: eb89aba..c7cf25d

What shipped:

  • 12 file-shape benchmark scenarios across wide, tall, sparse, and many-sheets categories.
  • excelbench perf-file-shape CLI with category filtering, tier caps, on-demand fixture regeneration, and Sprint 1 memory-mode support.
  • n_sheets / sheet_pattern workload fan-out so many-sheets runs exercise per-sheet overhead without duplicating dtype logic.
  • _section_file_shape dashboard heatmaps for read/write throughput by shape category.
  • Cross-command staleness guards for data_shape_* and file_shape_* manifests, fixing the Codex P1 review finding.
  • 54 focused data-shape/file-shape tests covering CLI helpers, staleness detection, runner fan-out, and dashboard rendering.

Verification:

  • uv run pytest tests/test_perf_file_shape.py tests/test_perf_data_shape.py -v --no-cov ✓ (54 passed)
  • uv run ruff check src/ tests/ scripts/
  • uv run mypy src/
  • PR #32 CI ✓: lint, test 3.11, test 3.12, benchmark, rust_smoke.

Decisions: DEC-020 logged in decisions.md.

Deferred / out-of-scope:

  • Full 16+ adapter run at the 1M tier remains a bench-machine task.
  • Cross-product of file shape × dtype remains deferred until dashboard data shows that interaction is worth the matrix cost.

S2 — Data shape (int/str/date/formula) (2026-04-27)

Branch: feat/perf-data-shape · PR: #31 · Commit range: 373c896..cbb530f

What shipped:

  • 10 dtypes × 4 cell-count tiers (1k/10k/100k/1m) data-shape benchmark matrix. Dtypes: int, float, string_short, string_long, boolean, date, datetime, formula_simple, formula_cross_sheet, mixed_realistic.
  • excelbench perf-shape CLI subcommand with on-demand fixture regen, staleness detection, --types/--rows filtering, inherits Sprint 1's --memory-mode plumbing.
  • scripts/generate_throughput_fixtures.py extended with generate_data_shape_scenarios(), --shape-only, --include-1m flags. Generator/runner content stays in lockstep via shared 1-based offset convention (fixed in PR review).
  • _run_workload_write extended with 6 new value_type branches (date, datetime, boolean, formula_simple, formula_cross_sheet, mixed_realistic) plus float coverage via existing number path.
  • _section_data_shape dashboard helper — per-dtype log-normalized heatmaps for read and write at the largest available tier; tooltip shows the full per-tier curve.
  • mixed_realistic ratio (60/30/5/3/2 string/int/date/formula/blank) documented in fixtures/synthetic_calibration/sample_set.md.
  • 31 new tests in tests/test_perf_data_shape.py covering all value_type branches, CLI helpers, dashboard rendering, and end-to-end perf_shape invocation.

Verification:

  • uv run pytest tests/ ✓ (1171 passed, 32 skipped, 6 xfailed)
  • uv run ruff check src/ tests/ scripts/
  • uv run mypy src/
  • Local coverage 67.64%, Linux CI ~65.9% (gate 65%).
  • All 5 CI jobs green: lint, test 3.11/3.12, benchmark, rust_smoke.
  • 9 Copilot inline review comments addressed (tier cap mismatch, generator/runner offset for boolean/date/datetime, README wording, scenario-vs-row count, isinstance form) with reply threads.

Decisions: DEC-019 logged in decisions.md.

Deferred / out-of-scope:

  • High-cost operations (append_rows, iter_rows_values, modify_one_cell, cell.font access) — Sprint 4.
  • File shape (wide/tall/sparse/many-sheets) — Sprint 3.
  • Cold-import cost per dtype — Sprint 6 (subprocess isolation needed).
  • Calibration of mixed_realistic against a corpus larger than the 50-file sample — flagged in DEC-019, revisit if dashboard data suggests the ratio is wrong.

S1 — Memory honesty + Tracker bootstrap (2026-04-27)

Branch: feat/perf-mem-honesty · PR: #28 · Commit range: 50dc104..HEAD (final range fills in on merge)

What shipped:

  • TRACKER.md (this file) — 7-row sprint table, row-flip protocol, acceptance template.
  • src/excelbench/perf/memory.py — three-mode memory harness (getrusage / tracemalloc / time via /usr/bin/time -l subprocess + all composite). MemoryProbe context manager for in-process modes; parse_time_l_stderr cross-platform parser (macOS BSD time + GNU time -l).
  • PerfOpResult extended with rss_kb_via_time and python_heap_peak_kb fields (existing rss_peak_mb preserved — backwards-compatible).
  • src/excelbench/perf/_iter_subprocess.py — internal subprocess entrypoint that runs one iteration per invocation; wrapped by parent under /usr/bin/time -l.
  • excelbench perf --memory-mode={getrusage,tracemalloc,time,all} CLI flag.
  • HTML dashboard renders dual RSS (MB) — getrusage / time -l cells with a tooltip explaining divergence whenever any entry has a time -l measurement.
  • DEC-018 documents why three modes coexist and what each is honest about.

Verification (run on macOS 25.2, Python 3.13):

  • uv run pytest tests/ ✓ 1140 passed, 32 skipped, 6 xfailed
  • uv run ruff check src/excelbench/perf/ src/excelbench/cli.py src/excelbench/results/html_dashboard.py
  • uv run mypy src/excelbench/perf/ ✓ no issues
  • excelbench perf --memory-mode=all --feature cell_values --adapter wolfxl --adapter openpyxl --warmup 1 --iters 2:
    • All three fields populated as expected.
    • Python-heap honesty signal landed: openpyxl uses 16× (read) and 227× (write) more Python heap than wolfxl on the same workload, confirming wolfxl pushes allocations into Rust.
    • time -l/getrusage ratio ~0.97x on small fixtures (subprocess startup dominates); expected to diverge meaningfully once Sprint 2 lands ≥1M-cell fixtures.

Decisions: DEC-018 logged in decisions.md.

Deferred / out-of-scope:

  • Tracemalloc reset semantics across nested probes — current code uses reset_peak() when a probe re-enters an already-traced context. Should be revisited if any caller starts tracemalloc outside the probe.
  • time -l subprocess support on Windows — skipped silently (no /usr/bin/time). Sprint 6 (cold-start) will set the precedent for cross-platform subprocess handling.
  • Visualizing the time-l/getrusage divergence as a dedicated chart — single dual-cell with tooltip is sufficient until S2 ships larger fixtures that make the gap visible.

Reference

  • Plan: see the wolfxl session that produced this tracker (multi-sprint roadmap).
  • Architecture: architecture.md
  • Decisions: decisions.md
  • Key seams: src/excelbench/perf/runner.py, src/excelbench/harness/adapters/base.py, src/excelbench/results/html_dashboard.py.