A vector database-powered semantic search system for WSO2 documentation. This project uses PostgreSQL with pgvector extension and Ollama embeddings to enable natural language queries over technical documentation.
- Semantic Search: Ask questions in natural language and get relevant documentation chunks
- Local Embeddings: Uses Ollama with
nomic-embed-textmodel for privacy and no API costs - Vector Similarity: Leverages PostgreSQL pgvector for efficient similarity search
- Document Chunking: Automatically splits documents into searchable chunks with overlap
- Retry Logic: Handles model loading/unloading with exponential backoff
- Docker Support: Easy PostgreSQL setup with pgvector extension
- Multiple Formats: Supports
.txtand.mdfiles
- Java 17+
- Maven 3.6+
- Docker & Docker Compose
- Ollama installed and running
- PostgreSQL client tools (optional, for CLI access)
โโโโโโโโโโโโโโโโโโโ
โ Documentation โ
โ (.md, .txt) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FolderToPgvectorIngestor โ
โ - Reads files โ
โ - Chunks text (1200/200) โ
โ - Generates embeddings โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Ollama API โ
โ (nomic-embed-text model) โ
โ - 768-dimensional vectors โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PostgreSQL + pgvector โ
โ - documents table โ
โ - chunks table โ
โ - Vector similarity index โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ QueryPgvectorChunks โ
โ - Natural language query โ
โ - Vector similarity search โ
โ - Top-K results โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
git clone https://github.com/BechtelCanDoIt/wso2-docs-kb.git
cd wso2-docs-kb# Start PostgreSQL with pgvector
docker-compose up -d
# The database will be automatically initialized with the required schema# Install Ollama (see https://ollama.ai)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the embedding model
ollama pull nomic-embed-textCreate a .env file in the project root:
# PostgreSQL Configuration
PG_JDBC_URL=jdbc:postgresql://localhost:5432/knowledge_base
PG_USER=kb_user
PG_PASS=kb_password
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=nomic-embed-text
# Query Configuration
QUERY_TOP_K=5
# Ingestion Configuration (optional)
ROOT_DIR=docscd code
mvn clean package
cd ..# Ingest all .txt and .md files from a directory
./read_documents.sh docs/your-documentation-folder
# Example output:
# Processing docs/admin-overview.md
# Inserted 5 chunks for admin-overview.md# Ask a question
./ask_question.sh "How do I manage users in WSO2?"
# Example output:
# Rank 1
# Distance 0.373
# Title Manage Users and Roles
# File managing-users.md
# Content Admin users can log in to...CREATE TABLE documents (
id uuid PRIMARY KEY,
title text,
file_name text,
file_path text,
tags text[],
created_at timestamptz DEFAULT now()
);CREATE TABLE chunks (
id uuid PRIMARY KEY,
document_id uuid REFERENCES documents(id) ON DELETE CASCADE,
chunk_index integer,
heading_path text,
content text,
embedding vector(768) -- nomic-embed-text dimension
);
CREATE INDEX chunks_embedding_idx ON chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);Edit FolderToPgvectorIngestor.java:
List<String> chunks = makeChunks(text,
1200, // chunk size (characters)
200 // overlap (characters)
);Edit .env to change number of results:
QUERY_TOP_K=10 # Return top 10 results instead of 5Edit the embed() method in both Java files:
int maxRetries = 3; // Number of retries
int retryDelayMs = 2000; // Initial delay in ms
// Exponential backoff: 2s โ 4s โ 8sAfter bulk ingestion, optimize your database:
psql -U kb_user -d knowledge_base-- Update statistics and optimize indexes
VACUUM ANALYZE chunks;
VACUUM ANALYZE documents;
REINDEX TABLE chunks;
REINDEX TABLE documents;
-- Check database stats
SELECT
COUNT(*) as total_chunks,
pg_size_pretty(pg_total_relation_size('chunks')) as total_size
FROM chunks;TRUNCATE TABLE documents CASCADE;Problem: Ollama unloads the model between requests, causing 500 errors.
Solution 1: Set Ollama to keep models loaded
export OLLAMA_KEEP_ALIVE=-1
OR
export OLLAMA_KEEP_ALIVE=10m
# Restart OllamaProblem: Cannot connect to Ollama server.
Solutions:
- Ensure Ollama is running:
ps aux | grep ollama - Check the URL in
.envmatches your Ollama instance - Test connectivity:
curl http://localhost:11434/api/tags
Problem: Embeddings API returns empty array.
Cause: Using wrong API field name (input instead of prompt)
Solution: This is fixed in the provided code. Ensure you're using the latest version.
- Run
VACUUM ANALYZEafter bulk inserts - Consider increasing
listsparameter in ivfflat index for larger datasets - Monitor query performance: typical queries should be < 100ms
-- For larger datasets (100k+ chunks), increase lists
DROP INDEX chunks_embedding_idx;
CREATE INDEX chunks_embedding_idx ON chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000); -- Adjust based on dataset size- Hybrid Search: Combine vector similarity with keyword search
- Metadata Filtering: Filter by tags, file names, or dates
- Reranking: Post-process results to prioritize exact matches
- Web UI: Simple web interface for queries
- Batch Queries: Process multiple questions at once
- Document Updates: Handle document versioning and updates
- Multiple Models: Support different embedding models
- RAG Integration: Add LLM-based answer generation
- API Server: REST API for queries
Why isn't exact keyword match always rank #1?
Vector search finds semantic similarity, not keyword matches. A document about "user management and administration" might rank higher than one titled "admin-overview" because the content is semantically closer to your query.
For better exact matches, consider:
- More descriptive queries: "What is the administration overview for WSO2?"
- Hybrid search (vector + keyword)
- Metadata filtering on file names
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
Apache 2 License
- pgvector - Postgres extension for vector similarity search
- Ollama - Local LLM and embedding server
- nomic-embed-text - High-quality embedding model
- WSO2 - Documentation source
For issues and questions:
- Open an issue on GitHub
- Check existing issues for solutions
- Review the Troubleshooting section above
Yes, most of this was generated using AI since this is for learning a new skill. Is this the best way? Probably not. Is this production ready? NOPE! The intention is just to understand the basic concept of chunking and insert into a vector type database system.