perf(backend): clarify backend contract and make DuckDB native-fast

# Backend Contract And Native Backend Performance Plan

## Summary

- Branch: perf/backend-contract-duckdb-native.
- GitHub issue: create one new umbrella issue in marco0560/codira, milestone Phase 1, labels enhancement, P2, area:backend, area:core, area:query, type:architecture.
- Confirmed decisions: full backend contract redesign, DuckDB-native writer inside the existing duckdb plugin name, strict parity targets, and pyarrow allowed for typed bulk ingestion.
- Non-goal: do not implement issue #20; embeddings stay in the deterministic backend until the optional vector backend work opens.

## Implementation Ledger

1. Create the branch and GitHub issue first.
    - Title: perf(backend): clarify backend contract and make DuckDB native-fast
    - Body: use this plan as the issue body.
    - Record the created issue number in a new execution ledger: docs/process/issue-<N>-backend-performance-execution.md.
2. Maintain the ledger throughout the branch.
    - Track phases, status, decisions, changed contract surfaces, benchmark commands, benchmark artifact paths, unresolved risks, and final merge evidence.
    - Update it after each phase, not only at the end.
3. Merge only after the final benchmark campaign, docs alignment, codira audit, full validation, and issue acceptance checklist are complete.

## Key Changes

1. Redesign the backend contract before optimizing implementations.
    - Replace hidden SQLite-shaped assumptions with an explicit index-session contract in src/codira/contracts.py.
    - Define separate expectations for write sessions and frequent read/query operations.
    - Core expectations:
        - one active backend per repository instance;
        - deterministic query-equivalent results across first-party backends;
        - full index may rebuild or replace storage wholesale;
        - incremental index receives changed, deleted, and reused path sets;
        - failed files must not leave visible partial rows, but per-file DB rollback/savepoints are not required;
        - ctx, calls, sym, symlist, and audit must use a cheap read path and must not trigger schema repair, derived-index rebuilds, or writer setup.
    - Remove compatibility requirements for previous backend storage versions; bump schema/contract version and rebuild/fail fast as needed.
2. Refactor core indexing around the new session flow.
    - Keep scanning, analyzer selection, and index planning in core.
    - Move backend mutation into begin_index_session(...) and a write-session object.
    - The session owns loading previous embeddings, preparing full/incremental storage, persisting analyzed files in batches, rebuilding derived indexes, writing runtime inventory, commit, abort, and close.
    - Preserve the current index report semantics: successful files count as indexed, failed files are reported, and failed files are excluded from committed backend state.
3. Rebuild DuckDB persistence as a native writer.
    - Add a DuckDB writer module that builds typed Arrow batches per logical table and bulk-loads them into DuckDB staging tables.
    - Remove the hot write path dependency on DB-API-style execute, executemany, cursor adapters, and per-row lastrowid.
    - Allocate internal IDs in batches, deterministically for full rebuilds and by bounded range allocation for incrementals.
    - Full index: write to a temporary DuckDB database, create indexes after bulk load, checkpoint, measure size, then atomically replace the active index.
    - Incremental index: write changed files into staging tables, delete changed/deleted paths, swap staged rows in one run-scoped transaction, then rebuild only required derived state.
    - Keep embeddings as typed binary Arrow data for now; do not introduce vector search or ANN behavior.
4. Fix DuckDB warm-readiness and query paths.
    - Add a read-only/cheap-open path for frequent commands.
    - Move schema repair and migration checks out of normal query opens.
    - Ensure unchanged codira index checks metadata, file hashes, analyzer inventory, and runtime inventory before any mutation setup.
    - Re-measure ctx, sym, calls, symlist, and audit only after full/warm index are fixed, then optimize remaining query-specific regressions.
5. Optimize SQLite as the control backend.
    - Port SQLite to the new index-session contract without changing operator behavior.
    - Keep SQLite savepoints only where they are still useful for incremental row-oriented writes; do not expose them as a contract requirement.
    - Optimize the observed regressions in ctx, warm index, symlist, audit, sym, and calls.
    - Focus on backend factory caching, connection reuse, cheaper readiness checks, fewer repeated backend calls from query producers, and batched reads for context assembly.
    - Add SQLite regression budgets so future backend-agnostic core work cannot silently slow critical commands.

## Documentation And Tests

- Update docs/architecture/storage-backends.md, docs/plugins/backends.md, docs/process/performance-benchmarking.md, and the new execution ledger.
- Add or update contract tests using SQLite, DuckDB, and the in-memory validation backend.
- Required scenarios:
    - full index with successful files and one failing file leaves no partial rows for the failed file;
    - incremental change/delete/reuse behavior is identical across first-party backends;
    - unchanged warm index does not enter writer setup;
    - frequent query commands are read-only and do not perform repair/rebuild work;
    - DuckDB bulk writer does not use per-row executemany/lastrowid paths;
    - SQLite and DuckDB query outputs match for ctx, sym, calls, symlist, and audit.
- Run docstring enforcement after edits:
    - uv run codira index
    - uv run codira audit --json
    - fix all reported docstring issues using NumPy style
    - rerun uv run codira index and uv run codira audit --json

## Performance And Merge Gates

- Establish same-branch pre-change baselines if the branch begins before implementation; otherwise compare against the saved .artifacts/analysis/2026-05-19-measurement-campaign-analysis.md and new post-
change artifacts.
- Run paired SQLite and DuckDB campaigns on the short manifest first, then the broader benchmark manifest before merge.
- Required acceptance:
    - DuckDB full index mean within 25% of SQLite or within 5 seconds, whichever is more forgiving;
    - DuckDB warm index, ctx, sym, calls, symlist, and audit within 10% of SQLite or within 250 ms, whichever is more forgiving;
    - DuckDB index size no more than 2x SQLite on each short-campaign repo, with the stretch goal of 1.5x;
    - SQLite critical-command regression no more than 3% or 100 ms versus same-branch baseline;
    - no failed pre-commit run --all-files;
    - no failed pytest -q.
- If DuckDB cannot meet the strict gates after the native Arrow writer and cheap-readiness work, stop the branch before merge and update the GitHub issue with measured blocker evidence.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(backend): clarify backend contract and make DuckDB native-fast #30

Backend Contract And Native Backend Performance Plan

Summary

Implementation Ledger

Key Changes

Documentation And Tests

Performance And Merge Gates

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(backend): clarify backend contract and make DuckDB native-fast #30

Description

Backend Contract And Native Backend Performance Plan

Summary

Implementation Ledger

Key Changes

Documentation And Tests

Performance And Merge Gates

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions