Feature: Introduce Optional Vector Database Backend for Semantic Retrieval

# Feature: Introduce Optional Vector Database Backend for Semantic Retrieval

## Status
- Type: architecture / feature
- Priority: medium (strategic, not immediate)
- Version target: 2.0.0+
- Design status: **needs-design**

## Summary

Introduce an **optional vector database backend** to support semantic retrieval (embeddings-based search) in Codira.

The vector backend must be:
- **strictly non-authoritative**
- **fully optional**
- **architecturally separated** from the deterministic backend

This feature is intended to support:
- semantic search
- documentation retrieval ranking
- cross-language retrieval

without compromising Codira’s core guarantees of:
- determinism
- reproducibility
- exactness

## Motivation

Codira is evolving toward:

- multi-channel retrieval (`symbol`, `docs`, `semantic`, …)
- multi-language indexing
- documentation-aware queries (#3)
- richer context building

These introduce:

- large volumes of text (especially from docs)
- natural-language queries
- need for semantic similarity

However:

> embeddings and ANN search are inherently non-deterministic

Therefore:

> semantic retrieval must be isolated from the deterministic core

## Problem

Current state:

- single backend stores:
  - deterministic data (symbols, metadata, edges)
  - embeddings (if enabled)

This creates:

- mixed concerns (exact + approximate in same layer)
- backend contract ambiguity
- scalability limits
- difficulty introducing specialized vector search

## Proposed Direction

### 1. Split Backend Model

Introduce two distinct backend roles:

```
DeterministicBackend (authoritative)
    - symbols
    - edges
    - metadata
    - coverage data
    - doc structure (IDs, locations)

VectorBackend (optional, derived)
    - embeddings
    - similarity index
```

### 2. Strict Authority Boundary

#### Deterministic backend:
- single source of truth
- required for all operations

#### Vector backend:
- derived cache
- fully disposable
- never required for correctness

### 3. Channel Mapping

```
symbol   → DeterministicBackend
docs     → DeterministicBackend (+ optional semantic ranking)
semantic → VectorBackend
```

### 4. Hard Constraints

#### C1 — Optionality
Codira MUST function fully without a vector backend.

#### C2 — Deterministic precedence
Semantic results MUST NOT override:
- exact symbol matches
- structural retrieval results

#### C3 — Identity binding
Each embedding MUST include:

```
{
  object_id,
  file_hash,
  analyzer_version,
  embedding_model_id
}
```

#### C4 — Rebuild semantics

Vector backend MUST be rebuildable from deterministic data:

```
rm vector_db → recompute → valid state
```

#### C5 — No cross-backend transactions
- no distributed consistency guarantees
- eventual consistency acceptable

## Relationship with Documentation Channel (#3)

The documentation channel introduces:

- large volumes of text (Markdown sections, module docs)
- unstructured content
- semantic-heavy queries

Impact:

- increases indexed units significantly
- increases embedding footprint
- increases reliance on semantic ranking

Conclusion:

> The docs channel is the primary driver for introducing a vector backend.

## When This Becomes Necessary

Indicative thresholds:

| Metric | Threshold |
|------|--------|
| Indexed units (symbols + docs) | > 100k |
| Embedding storage | > 500 MB |
| Query profile | semantic-heavy |
| Repo type | monorepo / multi-language |

Below these thresholds, a unified backend may remain acceptable.

## Candidate Vector Databases

The following systems are viable options for Codira.

### 1. FAISS

- Type: library (in-process)
- Storage: memory / disk (manual)
- Language: C++ / Python bindings

**Pros**
- extremely fast
- no external service
- deterministic build possible
- ideal for local workstation

**Cons**
- no persistence layer out-of-the-box
- no query API (must wrap)
- limited filtering / metadata support

**Use case**
- local, single-user Codira instance
- offline workflows

**Workstation availability**
- excellent (pip / conda)

---

### 2. SQLite + vector extension (e.g. sqlite-vss)

- Type: embedded DB
- Storage: file-based

**Pros**
- single-file persistence
- easy integration
- no additional service
- fits current SQLite model

**Cons**
- limited performance at scale
- limited ANN capabilities
- immature ecosystem

**Use case**
- small to medium repositories
- minimal setup environments

**Workstation availability**
- good (requires extension build/install)

---

### 3. DuckDB + vector extensions

- Type: embedded analytical DB

**Pros**
- strong columnar performance
- good for hybrid workloads
- better scaling than SQLite

**Cons**
- vector support still evolving
- not purpose-built for ANN

**Use case**
- medium-scale local setups

**Workstation availability**
- excellent (pip install duckdb)

---

### 4. Qdrant

- Type: dedicated vector DB (service)
- Storage: disk + memory

**Pros**
- production-grade ANN
- filtering + payload support
- REST + gRPC API
- good local deployment (Docker)

**Cons**
- requires running a service
- more complex setup
- external dependency

**Use case**
- large repositories
- multi-project setups
- semantic-heavy workflows

**Workstation availability**
- very good (Docker / binary)

---

### 5. pgvector (PostgreSQL extension)

- Type: extension to PostgreSQL

**Pros**
- integrates with relational data
- transactional
- mature ecosystem

**Cons**
- requires PostgreSQL
- heavier than needed for local use
- ANN performance < specialized DBs

**Use case**
- shared environments
- multi-user setups

**Workstation availability**
- good (requires PostgreSQL install)

---

### 6. Chroma

- Type: developer-focused vector DB

**Pros**
- simple API
- Python-native
- fast setup

**Cons**
- evolving rapidly
- weaker guarantees
- less control over internals

**Use case**
- prototyping
- experimentation

**Workstation availability**
- excellent (pip install)

---

## Non-Goals

- making semantic search authoritative
- replacing deterministic retrieval
- introducing heuristics into core logic
- requiring vector DB for core functionality
- supporting all vector DBs equally

---

## Open Design Questions

1. Backend interface:
   - separate `VectorBackend` protocol?
   - or extend existing backend abstraction?

2. Storage model:
   - embeddings per symbol?
   - embeddings per chunk (docs)?
   - hybrid?

3. Chunking policy:
   - fixed size vs structure-based (docs sections)

4. Index lifecycle:
   - synchronous vs async embedding generation?
   - partial rebuild handling?

5. Ranking integration:
   - how semantic scores combine with deterministic channels?

6. CLI surface:
   - `--semantic`
   - `--no-semantic`
   - explain output for embeddings

## Suggested Implementation Plan

### Phase 1 — Design (required)
- define VectorBackend interface
- define identity model
- define embedding pipeline

### Phase 2 — Minimal implementation
- FAISS-based backend (local, no service)
- opt-in CLI flag
- integration into `ctx`

### Phase 3 — Docs channel integration
- semantic ranking for documentation results

### Phase 4 — Additional backends
- Qdrant or pgvector

## Acceptance Criteria

- Codira works identically without vector backend
- vector backend is fully optional
- embeddings are reproducible and versioned
- semantic results never override deterministic matches
- vector backend can be deleted and rebuilt without side effects
- explain/JSON output includes semantic provenance

---

## Key Principle

> Semantic retrieval is a performance and usability enhancement, not a correctness mechanism.


Metric	Threshold
Indexed units (symbols + docs)	> 100k
Embedding storage	> 500 MB
Query profile	semantic-heavy
Repo type	monorepo / multi-language

Feature: Introduce Optional Vector Database Backend for Semantic Retrieval #20

Description

Feature: Introduce Optional Vector Database Backend for Semantic Retrieval

Status

Summary

Motivation

Problem

Proposed Direction

1. Split Backend Model

2. Strict Authority Boundary

Deterministic backend:

Vector backend:

3. Channel Mapping

4. Hard Constraints

C1 — Optionality

C2 — Deterministic precedence

C3 — Identity binding

C4 — Rebuild semantics

C5 — No cross-backend transactions

Relationship with Documentation Channel (#3)

When This Becomes Necessary

Candidate Vector Databases

1. FAISS

2. SQLite + vector extension (e.g. sqlite-vss)

3. DuckDB + vector extensions

4. Qdrant

5. pgvector (PostgreSQL extension)

6. Chroma

Non-Goals

Open Design Questions

Suggested Implementation Plan

Phase 1 — Design (required)

Phase 2 — Minimal implementation

Phase 3 — Docs channel integration

Phase 4 — Additional backends

Acceptance Criteria

Key Principle

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions