Feature: Introduce Optional Vector Database Backend for Semantic Retrieval
Status
- Type: architecture / feature
- Priority: medium (strategic, not immediate)
- Version target: 2.0.0+
- Design status: needs-design
Summary
Introduce an optional vector database backend to support semantic retrieval (embeddings-based search) in Codira.
The vector backend must be:
- strictly non-authoritative
- fully optional
- architecturally separated from the deterministic backend
This feature is intended to support:
- semantic search
- documentation retrieval ranking
- cross-language retrieval
without compromising Codira’s core guarantees of:
- determinism
- reproducibility
- exactness
Motivation
Codira is evolving toward:
These introduce:
- large volumes of text (especially from docs)
- natural-language queries
- need for semantic similarity
However:
embeddings and ANN search are inherently non-deterministic
Therefore:
semantic retrieval must be isolated from the deterministic core
Problem
Current state:
- single backend stores:
- deterministic data (symbols, metadata, edges)
- embeddings (if enabled)
This creates:
- mixed concerns (exact + approximate in same layer)
- backend contract ambiguity
- scalability limits
- difficulty introducing specialized vector search
Proposed Direction
1. Split Backend Model
Introduce two distinct backend roles:
DeterministicBackend (authoritative)
- symbols
- edges
- metadata
- coverage data
- doc structure (IDs, locations)
VectorBackend (optional, derived)
- embeddings
- similarity index
2. Strict Authority Boundary
Deterministic backend:
- single source of truth
- required for all operations
Vector backend:
- derived cache
- fully disposable
- never required for correctness
3. Channel Mapping
symbol → DeterministicBackend
docs → DeterministicBackend (+ optional semantic ranking)
semantic → VectorBackend
4. Hard Constraints
C1 — Optionality
Codira MUST function fully without a vector backend.
C2 — Deterministic precedence
Semantic results MUST NOT override:
- exact symbol matches
- structural retrieval results
C3 — Identity binding
Each embedding MUST include:
{
object_id,
file_hash,
analyzer_version,
embedding_model_id
}
C4 — Rebuild semantics
Vector backend MUST be rebuildable from deterministic data:
rm vector_db → recompute → valid state
C5 — No cross-backend transactions
- no distributed consistency guarantees
- eventual consistency acceptable
Relationship with Documentation Channel (#3)
The documentation channel introduces:
- large volumes of text (Markdown sections, module docs)
- unstructured content
- semantic-heavy queries
Impact:
- increases indexed units significantly
- increases embedding footprint
- increases reliance on semantic ranking
Conclusion:
The docs channel is the primary driver for introducing a vector backend.
When This Becomes Necessary
Indicative thresholds:
| Metric |
Threshold |
| Indexed units (symbols + docs) |
> 100k |
| Embedding storage |
> 500 MB |
| Query profile |
semantic-heavy |
| Repo type |
monorepo / multi-language |
Below these thresholds, a unified backend may remain acceptable.
Candidate Vector Databases
The following systems are viable options for Codira.
1. FAISS
- Type: library (in-process)
- Storage: memory / disk (manual)
- Language: C++ / Python bindings
Pros
- extremely fast
- no external service
- deterministic build possible
- ideal for local workstation
Cons
- no persistence layer out-of-the-box
- no query API (must wrap)
- limited filtering / metadata support
Use case
- local, single-user Codira instance
- offline workflows
Workstation availability
2. SQLite + vector extension (e.g. sqlite-vss)
- Type: embedded DB
- Storage: file-based
Pros
- single-file persistence
- easy integration
- no additional service
- fits current SQLite model
Cons
- limited performance at scale
- limited ANN capabilities
- immature ecosystem
Use case
- small to medium repositories
- minimal setup environments
Workstation availability
- good (requires extension build/install)
3. DuckDB + vector extensions
- Type: embedded analytical DB
Pros
- strong columnar performance
- good for hybrid workloads
- better scaling than SQLite
Cons
- vector support still evolving
- not purpose-built for ANN
Use case
- medium-scale local setups
Workstation availability
- excellent (pip install duckdb)
4. Qdrant
- Type: dedicated vector DB (service)
- Storage: disk + memory
Pros
- production-grade ANN
- filtering + payload support
- REST + gRPC API
- good local deployment (Docker)
Cons
- requires running a service
- more complex setup
- external dependency
Use case
- large repositories
- multi-project setups
- semantic-heavy workflows
Workstation availability
- very good (Docker / binary)
5. pgvector (PostgreSQL extension)
- Type: extension to PostgreSQL
Pros
- integrates with relational data
- transactional
- mature ecosystem
Cons
- requires PostgreSQL
- heavier than needed for local use
- ANN performance < specialized DBs
Use case
- shared environments
- multi-user setups
Workstation availability
- good (requires PostgreSQL install)
6. Chroma
- Type: developer-focused vector DB
Pros
- simple API
- Python-native
- fast setup
Cons
- evolving rapidly
- weaker guarantees
- less control over internals
Use case
- prototyping
- experimentation
Workstation availability
Non-Goals
- making semantic search authoritative
- replacing deterministic retrieval
- introducing heuristics into core logic
- requiring vector DB for core functionality
- supporting all vector DBs equally
Open Design Questions
-
Backend interface:
- separate
VectorBackend protocol?
- or extend existing backend abstraction?
-
Storage model:
- embeddings per symbol?
- embeddings per chunk (docs)?
- hybrid?
-
Chunking policy:
- fixed size vs structure-based (docs sections)
-
Index lifecycle:
- synchronous vs async embedding generation?
- partial rebuild handling?
-
Ranking integration:
- how semantic scores combine with deterministic channels?
-
CLI surface:
--semantic
--no-semantic
- explain output for embeddings
Suggested Implementation Plan
Phase 1 — Design (required)
- define VectorBackend interface
- define identity model
- define embedding pipeline
Phase 2 — Minimal implementation
- FAISS-based backend (local, no service)
- opt-in CLI flag
- integration into
ctx
Phase 3 — Docs channel integration
- semantic ranking for documentation results
Phase 4 — Additional backends
Acceptance Criteria
- Codira works identically without vector backend
- vector backend is fully optional
- embeddings are reproducible and versioned
- semantic results never override deterministic matches
- vector backend can be deleted and rebuilt without side effects
- explain/JSON output includes semantic provenance
Key Principle
Semantic retrieval is a performance and usability enhancement, not a correctness mechanism.
Feature: Introduce Optional Vector Database Backend for Semantic Retrieval
Status
Summary
Introduce an optional vector database backend to support semantic retrieval (embeddings-based search) in Codira.
The vector backend must be:
This feature is intended to support:
without compromising Codira’s core guarantees of:
Motivation
Codira is evolving toward:
symbol,docs,semantic, …)These introduce:
However:
Therefore:
Problem
Current state:
This creates:
Proposed Direction
1. Split Backend Model
Introduce two distinct backend roles:
2. Strict Authority Boundary
Deterministic backend:
Vector backend:
3. Channel Mapping
4. Hard Constraints
C1 — Optionality
Codira MUST function fully without a vector backend.
C2 — Deterministic precedence
Semantic results MUST NOT override:
C3 — Identity binding
Each embedding MUST include:
C4 — Rebuild semantics
Vector backend MUST be rebuildable from deterministic data:
C5 — No cross-backend transactions
Relationship with Documentation Channel (#3)
The documentation channel introduces:
Impact:
Conclusion:
When This Becomes Necessary
Indicative thresholds:
Below these thresholds, a unified backend may remain acceptable.
Candidate Vector Databases
The following systems are viable options for Codira.
1. FAISS
Pros
Cons
Use case
Workstation availability
2. SQLite + vector extension (e.g. sqlite-vss)
Pros
Cons
Use case
Workstation availability
3. DuckDB + vector extensions
Pros
Cons
Use case
Workstation availability
4. Qdrant
Pros
Cons
Use case
Workstation availability
5. pgvector (PostgreSQL extension)
Pros
Cons
Use case
Workstation availability
6. Chroma
Pros
Cons
Use case
Workstation availability
Non-Goals
Open Design Questions
Backend interface:
VectorBackendprotocol?Storage model:
Chunking policy:
Index lifecycle:
Ranking integration:
CLI surface:
--semantic--no-semanticSuggested Implementation Plan
Phase 1 — Design (required)
Phase 2 — Minimal implementation
ctxPhase 3 — Docs channel integration
Phase 4 — Additional backends
Acceptance Criteria
Key Principle