Skip to content

perf(backend): clarify backend contract and make DuckDB native-fast #30

@marco0560

Description

@marco0560

Backend Contract And Native Backend Performance Plan

Summary

  • Branch: perf/backend-contract-duckdb-native.
  • GitHub issue: create one new umbrella issue in marco0560/codira, milestone Phase 1, labels enhancement, P2, area:backend, area:core, area:query, type:architecture.
  • Confirmed decisions: full backend contract redesign, DuckDB-native writer inside the existing duckdb plugin name, strict parity targets, and pyarrow allowed for typed bulk ingestion.
  • Non-goal: do not implement issue Feature: Introduce Optional Vector Database Backend for Semantic Retrieval #20; embeddings stay in the deterministic backend until the optional vector backend work opens.

Implementation Ledger

  1. Create the branch and GitHub issue first.
    • Title: perf(backend): clarify backend contract and make DuckDB native-fast
    • Body: use this plan as the issue body.
    • Record the created issue number in a new execution ledger: docs/process/issue--backend-performance-execution.md.
  2. Maintain the ledger throughout the branch.
    • Track phases, status, decisions, changed contract surfaces, benchmark commands, benchmark artifact paths, unresolved risks, and final merge evidence.
    • Update it after each phase, not only at the end.
  3. Merge only after the final benchmark campaign, docs alignment, codira audit, full validation, and issue acceptance checklist are complete.

Key Changes

  1. Redesign the backend contract before optimizing implementations.
    • Replace hidden SQLite-shaped assumptions with an explicit index-session contract in src/codira/contracts.py.
    • Define separate expectations for write sessions and frequent read/query operations.
    • Core expectations:
      • one active backend per repository instance;
      • deterministic query-equivalent results across first-party backends;
      • full index may rebuild or replace storage wholesale;
      • incremental index receives changed, deleted, and reused path sets;
      • failed files must not leave visible partial rows, but per-file DB rollback/savepoints are not required;
      • ctx, calls, sym, symlist, and audit must use a cheap read path and must not trigger schema repair, derived-index rebuilds, or writer setup.
    • Remove compatibility requirements for previous backend storage versions; bump schema/contract version and rebuild/fail fast as needed.
  2. Refactor core indexing around the new session flow.
    • Keep scanning, analyzer selection, and index planning in core.
    • Move backend mutation into begin_index_session(...) and a write-session object.
    • The session owns loading previous embeddings, preparing full/incremental storage, persisting analyzed files in batches, rebuilding derived indexes, writing runtime inventory, commit, abort, and close.
    • Preserve the current index report semantics: successful files count as indexed, failed files are reported, and failed files are excluded from committed backend state.
  3. Rebuild DuckDB persistence as a native writer.
    • Add a DuckDB writer module that builds typed Arrow batches per logical table and bulk-loads them into DuckDB staging tables.
    • Remove the hot write path dependency on DB-API-style execute, executemany, cursor adapters, and per-row lastrowid.
    • Allocate internal IDs in batches, deterministically for full rebuilds and by bounded range allocation for incrementals.
    • Full index: write to a temporary DuckDB database, create indexes after bulk load, checkpoint, measure size, then atomically replace the active index.
    • Incremental index: write changed files into staging tables, delete changed/deleted paths, swap staged rows in one run-scoped transaction, then rebuild only required derived state.
    • Keep embeddings as typed binary Arrow data for now; do not introduce vector search or ANN behavior.
  4. Fix DuckDB warm-readiness and query paths.
    • Add a read-only/cheap-open path for frequent commands.
    • Move schema repair and migration checks out of normal query opens.
    • Ensure unchanged codira index checks metadata, file hashes, analyzer inventory, and runtime inventory before any mutation setup.
    • Re-measure ctx, sym, calls, symlist, and audit only after full/warm index are fixed, then optimize remaining query-specific regressions.
  5. Optimize SQLite as the control backend.
    • Port SQLite to the new index-session contract without changing operator behavior.
    • Keep SQLite savepoints only where they are still useful for incremental row-oriented writes; do not expose them as a contract requirement.
    • Optimize the observed regressions in ctx, warm index, symlist, audit, sym, and calls.
    • Focus on backend factory caching, connection reuse, cheaper readiness checks, fewer repeated backend calls from query producers, and batched reads for context assembly.
    • Add SQLite regression budgets so future backend-agnostic core work cannot silently slow critical commands.

Documentation And Tests

  • Update docs/architecture/storage-backends.md, docs/plugins/backends.md, docs/process/performance-benchmarking.md, and the new execution ledger.
  • Add or update contract tests using SQLite, DuckDB, and the in-memory validation backend.
  • Required scenarios:
    • full index with successful files and one failing file leaves no partial rows for the failed file;
    • incremental change/delete/reuse behavior is identical across first-party backends;
    • unchanged warm index does not enter writer setup;
    • frequent query commands are read-only and do not perform repair/rebuild work;
    • DuckDB bulk writer does not use per-row executemany/lastrowid paths;
    • SQLite and DuckDB query outputs match for ctx, sym, calls, symlist, and audit.
  • Run docstring enforcement after edits:
    • uv run codira index
    • uv run codira audit --json
    • fix all reported docstring issues using NumPy style
    • rerun uv run codira index and uv run codira audit --json

Performance And Merge Gates

  • Establish same-branch pre-change baselines if the branch begins before implementation; otherwise compare against the saved .artifacts/analysis/2026-05-19-measurement-campaign-analysis.md and new post-
    change artifacts.
  • Run paired SQLite and DuckDB campaigns on the short manifest first, then the broader benchmark manifest before merge.
  • Required acceptance:
    • DuckDB full index mean within 25% of SQLite or within 5 seconds, whichever is more forgiving;
    • DuckDB warm index, ctx, sym, calls, symlist, and audit within 10% of SQLite or within 250 ms, whichever is more forgiving;
    • DuckDB index size no more than 2x SQLite on each short-campaign repo, with the stretch goal of 1.5x;
    • SQLite critical-command regression no more than 3% or 100 ms versus same-branch baseline;
    • no failed pre-commit run --all-files;
    • no failed pytest -q.
  • If DuckDB cannot meet the strict gates after the native Arrow writer and cheap-readiness work, stop the branch before merge and update the GitHub issue with measured blocker evidence.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions