Skip to content

Feature: Introduce Optional Vector Database Backend for Semantic Retrieval #20

@marco0560

Description

@marco0560

Feature: Introduce Optional Vector Database Backend for Semantic Retrieval

Status

  • Type: architecture / feature
  • Priority: medium (strategic, not immediate)
  • Version target: 2.0.0+
  • Design status: needs-design

Summary

Introduce an optional vector database backend to support semantic retrieval (embeddings-based search) in Codira.

The vector backend must be:

  • strictly non-authoritative
  • fully optional
  • architecturally separated from the deterministic backend

This feature is intended to support:

  • semantic search
  • documentation retrieval ranking
  • cross-language retrieval

without compromising Codira’s core guarantees of:

  • determinism
  • reproducibility
  • exactness

Motivation

Codira is evolving toward:

These introduce:

  • large volumes of text (especially from docs)
  • natural-language queries
  • need for semantic similarity

However:

embeddings and ANN search are inherently non-deterministic

Therefore:

semantic retrieval must be isolated from the deterministic core

Problem

Current state:

  • single backend stores:
    • deterministic data (symbols, metadata, edges)
    • embeddings (if enabled)

This creates:

  • mixed concerns (exact + approximate in same layer)
  • backend contract ambiguity
  • scalability limits
  • difficulty introducing specialized vector search

Proposed Direction

1. Split Backend Model

Introduce two distinct backend roles:

DeterministicBackend (authoritative)
    - symbols
    - edges
    - metadata
    - coverage data
    - doc structure (IDs, locations)

VectorBackend (optional, derived)
    - embeddings
    - similarity index

2. Strict Authority Boundary

Deterministic backend:

  • single source of truth
  • required for all operations

Vector backend:

  • derived cache
  • fully disposable
  • never required for correctness

3. Channel Mapping

symbol   → DeterministicBackend
docs     → DeterministicBackend (+ optional semantic ranking)
semantic → VectorBackend

4. Hard Constraints

C1 — Optionality

Codira MUST function fully without a vector backend.

C2 — Deterministic precedence

Semantic results MUST NOT override:

  • exact symbol matches
  • structural retrieval results

C3 — Identity binding

Each embedding MUST include:

{
  object_id,
  file_hash,
  analyzer_version,
  embedding_model_id
}

C4 — Rebuild semantics

Vector backend MUST be rebuildable from deterministic data:

rm vector_db → recompute → valid state

C5 — No cross-backend transactions

  • no distributed consistency guarantees
  • eventual consistency acceptable

Relationship with Documentation Channel (#3)

The documentation channel introduces:

  • large volumes of text (Markdown sections, module docs)
  • unstructured content
  • semantic-heavy queries

Impact:

  • increases indexed units significantly
  • increases embedding footprint
  • increases reliance on semantic ranking

Conclusion:

The docs channel is the primary driver for introducing a vector backend.

When This Becomes Necessary

Indicative thresholds:

Metric Threshold
Indexed units (symbols + docs) > 100k
Embedding storage > 500 MB
Query profile semantic-heavy
Repo type monorepo / multi-language

Below these thresholds, a unified backend may remain acceptable.

Candidate Vector Databases

The following systems are viable options for Codira.

1. FAISS

  • Type: library (in-process)
  • Storage: memory / disk (manual)
  • Language: C++ / Python bindings

Pros

  • extremely fast
  • no external service
  • deterministic build possible
  • ideal for local workstation

Cons

  • no persistence layer out-of-the-box
  • no query API (must wrap)
  • limited filtering / metadata support

Use case

  • local, single-user Codira instance
  • offline workflows

Workstation availability

  • excellent (pip / conda)

2. SQLite + vector extension (e.g. sqlite-vss)

  • Type: embedded DB
  • Storage: file-based

Pros

  • single-file persistence
  • easy integration
  • no additional service
  • fits current SQLite model

Cons

  • limited performance at scale
  • limited ANN capabilities
  • immature ecosystem

Use case

  • small to medium repositories
  • minimal setup environments

Workstation availability

  • good (requires extension build/install)

3. DuckDB + vector extensions

  • Type: embedded analytical DB

Pros

  • strong columnar performance
  • good for hybrid workloads
  • better scaling than SQLite

Cons

  • vector support still evolving
  • not purpose-built for ANN

Use case

  • medium-scale local setups

Workstation availability

  • excellent (pip install duckdb)

4. Qdrant

  • Type: dedicated vector DB (service)
  • Storage: disk + memory

Pros

  • production-grade ANN
  • filtering + payload support
  • REST + gRPC API
  • good local deployment (Docker)

Cons

  • requires running a service
  • more complex setup
  • external dependency

Use case

  • large repositories
  • multi-project setups
  • semantic-heavy workflows

Workstation availability

  • very good (Docker / binary)

5. pgvector (PostgreSQL extension)

  • Type: extension to PostgreSQL

Pros

  • integrates with relational data
  • transactional
  • mature ecosystem

Cons

  • requires PostgreSQL
  • heavier than needed for local use
  • ANN performance < specialized DBs

Use case

  • shared environments
  • multi-user setups

Workstation availability

  • good (requires PostgreSQL install)

6. Chroma

  • Type: developer-focused vector DB

Pros

  • simple API
  • Python-native
  • fast setup

Cons

  • evolving rapidly
  • weaker guarantees
  • less control over internals

Use case

  • prototyping
  • experimentation

Workstation availability

  • excellent (pip install)

Non-Goals

  • making semantic search authoritative
  • replacing deterministic retrieval
  • introducing heuristics into core logic
  • requiring vector DB for core functionality
  • supporting all vector DBs equally

Open Design Questions

  1. Backend interface:

    • separate VectorBackend protocol?
    • or extend existing backend abstraction?
  2. Storage model:

    • embeddings per symbol?
    • embeddings per chunk (docs)?
    • hybrid?
  3. Chunking policy:

    • fixed size vs structure-based (docs sections)
  4. Index lifecycle:

    • synchronous vs async embedding generation?
    • partial rebuild handling?
  5. Ranking integration:

    • how semantic scores combine with deterministic channels?
  6. CLI surface:

    • --semantic
    • --no-semantic
    • explain output for embeddings

Suggested Implementation Plan

Phase 1 — Design (required)

  • define VectorBackend interface
  • define identity model
  • define embedding pipeline

Phase 2 — Minimal implementation

  • FAISS-based backend (local, no service)
  • opt-in CLI flag
  • integration into ctx

Phase 3 — Docs channel integration

  • semantic ranking for documentation results

Phase 4 — Additional backends

  • Qdrant or pgvector

Acceptance Criteria

  • Codira works identically without vector backend
  • vector backend is fully optional
  • embeddings are reproducible and versioned
  • semantic results never override deterministic matches
  • vector backend can be deleted and rebuilt without side effects
  • explain/JSON output includes semantic provenance

Key Principle

Semantic retrieval is a performance and usability enhancement, not a correctness mechanism.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions