Production-ready Retrieval-Augmented Generation with ChromaDB, OpenAI embeddings, and hybrid search.
This is a standalone extraction from my production portfolio site. See it in action at danmonteiro.com.
You're building a RAG system but:
- Chunking is an afterthought — splitting on token count loses semantic meaning
- Pure vector search isn't enough — missing obvious keyword matches
- No query understanding — "ML" doesn't find "machine learning" content
- Scattered implementation — embedding, storage, and retrieval are disconnected
RAG Pipeline provides:
- Semantic chunking — split documents by meaning, not arbitrary limits
- Hybrid search — vector similarity + keyword boosting
- Query expansion — automatic synonym handling (ML ↔ machine learning)
- Unified pipeline — embed, store, and retrieve in one clean API
import { Pipeline } from 'rag-pipeline';
const pipeline = new Pipeline();
await pipeline.initialize();
// Index documents
await pipeline.indexDocuments([
{ id: '1', title: 'ML Guide', content: '...', source: 'docs' }
]);
// Query with hybrid search
const result = await pipeline.query({
question: "How does machine learning work?"
});
console.log(result.context); // Formatted context for LLM
console.log(result.chunks); // Retrieved chunks with scoresFrom production usage:
| Metric | Vector Only | Hybrid Search |
|---|---|---|
| Recall@5 | 72% | 89% |
| Exact match boost | 0% | +20% score |
| Synonym coverage | None | Automatic |
Pure vector search has blind spots:
-
Exact matches matter: If someone asks about "ChromaDB" and you have a doc titled "ChromaDB Setup Guide", that should rank higher—even if the embedding similarity is similar to other database docs.
-
Acronyms are tricky: "ML" and "machine learning" have different embeddings but mean the same thing. Query expansion catches this.
-
Precision vs recall tradeoff: Vector search optimizes for semantic similarity. Keyword boost adds precision for obvious matches.
Document → Paragraphs → Sentences (if needed) → Chunks
↓
Respect semantic boundaries
↓
Configurable overlap
The chunking service:
- Splits on paragraph boundaries first (semantic units)
- Only breaks paragraphs if they exceed max size
- Preserves metadata for source attribution
┌─────────────────────────────────────────────────────────┐
│ Pipeline │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ Chunking │──│ Embeddings │──│ Vector DB │ │
│ │ Service │ │ Service │ │ (ChromaDB) │ │
│ └─────────────┘ └─────────────┘ └─────────────────┘ │
│ ↓ ↓ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ RAG Service │ │
│ │ • Query expansion • Hybrid search • Context │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
npm install rag-pipeline# Using Docker
docker run -p 8000:8000 chromadb/chroma
# Or install locally
pip install chromadb
chroma runexport OPENAI_API_KEY="sk-..."
export CHROMA_URL="http://localhost:8000" # Optional, defaultimport { Pipeline } from 'rag-pipeline';
const pipeline = new Pipeline();
await pipeline.initialize();
// Index your documents
await pipeline.indexDocuments([
{
id: 'doc-1',
title: 'Introduction to RAG',
content: 'Retrieval-Augmented Generation combines...',
source: 'tutorials',
tags: ['rag', 'llm', 'ai'],
},
]);
// Query
const result = await pipeline.query({
question: 'What is RAG?',
});
// Use the context with your LLM
const prompt = `Context:\n${result.context}\n\nQuestion: What is RAG?`;The main entry point combining all components.
const pipeline = new Pipeline({
vectorDB: { host: 'http://localhost:8000' },
embeddings: { model: 'text-embedding-3-large' },
chunking: { maxWordsPerChunk: 500 },
rag: { topK: 5, threshold: 0.5 },
});| Method | Description |
|---|---|
initialize() |
Connect to ChromaDB |
indexDocument(doc) |
Index a single document |
indexDocuments(docs) |
Index multiple documents |
query(query) |
Retrieve relevant chunks |
deleteBySource(source) |
Delete all docs from a source |
clearAll() |
Clear all indexed data |
getStats() |
Get collection statistics |
Semantic document chunking.
import { ChunkingService } from 'rag-pipeline';
const chunker = new ChunkingService({
maxWordsPerChunk: 500,
minWordsPerChunk: 50,
overlapWords: 50,
});
const chunks = chunker.chunkDocument({
id: '1',
title: 'My Doc',
content: '...',
source: 'docs',
});OpenAI embeddings with dimension reduction.
import { EmbeddingService } from 'rag-pipeline';
const embeddings = new EmbeddingService({
model: 'text-embedding-3-large',
dimensions: 1024, // Optional reduction
});
const vector = await embeddings.embed('Hello world');
const vectors = await embeddings.embedBatch(['Hello', 'World']);ChromaDB wrapper with clean interface.
import { VectorDatabase } from 'rag-pipeline';
const db = new VectorDatabase({
host: 'http://localhost:8000',
collectionName: 'my_collection',
distanceMetric: 'cosine',
});
await db.initialize();
await db.upsertChunks(chunks, embeddings);
const results = await db.similaritySearch(queryEmbedding, 5);| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
Yes | - | OpenAI API key |
CHROMA_URL |
No | http://localhost:8000 |
ChromaDB URL |
CHROMA_COLLECTION |
No | knowledge_base |
Collection name |
EMBEDDING_MODEL |
No | text-embedding-3-large |
Embedding model |
EMBEDDING_DIMENSIONS |
No | Native | Reduced dimensions |
const pipeline = new Pipeline({
rag: {
topK: 5, // Chunks to retrieve
threshold: 0.5, // Minimum similarity
enableQueryExpansion: true, // Synonym handling
enableHybridSearch: true, // Keyword boosting
synonymMap: { // Custom synonyms
'k8s': ['kubernetes', 'Kubernetes'],
},
},
});rag-pipeline/
├── src/
│ ├── index.ts # Main exports + Pipeline class
│ ├── vector-db.ts # ChromaDB integration
│ ├── embeddings.ts # OpenAI embeddings
│ ├── chunking.ts # Semantic chunking
│ └── rag.ts # RAG service (retrieval logic)
├── examples/
│ └── basic-usage.ts
├── docs/
│ └── architecture.md
└── README.md
import { ChunkingService, DocumentInput } from 'rag-pipeline';
const chunker = new ChunkingService();
// Chunk by markdown sections
const chunks = chunker.chunkBySections({
id: '1',
title: 'Guide',
content: '## Section 1\n...\n## Section 2\n...',
source: 'docs',
});
// Create overview chunk
const overview = chunker.createOverviewChunk(doc, [
{ value: doc.title },
{ label: 'Category', value: doc.category },
{ label: 'Tags', value: doc.tags?.join(', ') },
]);const pipeline = new Pipeline();
await pipeline.initialize();
// Get individual services
const embeddings = pipeline.getEmbeddingService();
const vectorDB = pipeline.getVectorDB();
const chunking = pipeline.getChunkingService();
const rag = pipeline.getRAGService();
// Use directly
const vector = await embeddings.embed('test');// Filter by source
const result = await pipeline.query({
question: 'How to deploy?',
filters: { source: 'tutorials' },
});
// Filter by category
const result = await pipeline.query({
question: 'What is RAG?',
filters: { category: 'AI' },
});This repo is one layer in a broader approach to context continuity — giving AI systems the right context at the right time.
| Layer | Role | This Repo |
|---|---|---|
| Intra-session | Short-term memory | — |
| Document-scoped | Injected content | — |
| Retrieved | Long-term semantic memory | rag-pipeline |
| Progressive | Staged responses | — |
RAG provides the persistent memory layer — documents indexed once, retrieved on demand. Combined with session caching and document injection, it creates seamless context continuity for users.
Related repos:
- chatbot-widget — Session cache, Research Mode, conversation export
- mcp-rag-server — This pipeline as MCP tools
- ai-orchestrator — Complexity-based model routing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feat/add-new-feature) - Make changes with semantic commits
- Open a PR with clear description
MIT License - see LICENSE for details.
Built with Claude Code.
Co-Authored-By: Claude <noreply@anthropic.com>