Skip to content

BechtelCanDoIt/WSO2-DOCS-KB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

WSO2 Documentation Knowledge Base

A vector database-powered semantic search system for WSO2 documentation. This project uses PostgreSQL with pgvector extension and Ollama embeddings to enable natural language queries over technical documentation.

๐Ÿš€ Features

  • Semantic Search: Ask questions in natural language and get relevant documentation chunks
  • Local Embeddings: Uses Ollama with nomic-embed-text model for privacy and no API costs
  • Vector Similarity: Leverages PostgreSQL pgvector for efficient similarity search
  • Document Chunking: Automatically splits documents into searchable chunks with overlap
  • Retry Logic: Handles model loading/unloading with exponential backoff
  • Docker Support: Easy PostgreSQL setup with pgvector extension
  • Multiple Formats: Supports .txt and .md files

๐Ÿ“‹ Prerequisites

  • Java 17+
  • Maven 3.6+
  • Docker & Docker Compose
  • Ollama installed and running
  • PostgreSQL client tools (optional, for CLI access)

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Documentation  โ”‚
โ”‚   (.md, .txt)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  FolderToPgvectorIngestor   โ”‚
โ”‚  - Reads files              โ”‚
โ”‚  - Chunks text (1200/200)   โ”‚
โ”‚  - Generates embeddings     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      Ollama API             โ”‚
โ”‚  (nomic-embed-text model)   โ”‚
โ”‚  - 768-dimensional vectors  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   PostgreSQL + pgvector     โ”‚
โ”‚  - documents table          โ”‚
โ”‚  - chunks table             โ”‚
โ”‚  - Vector similarity index  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   QueryPgvectorChunks       โ”‚
โ”‚  - Natural language query   โ”‚
โ”‚  - Vector similarity search โ”‚
โ”‚  - Top-K results            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ› ๏ธ Installation

1. Clone the Repository

git clone https://github.com/BechtelCanDoIt/wso2-docs-kb.git
cd wso2-docs-kb

2. Set Up PostgreSQL with pgvector

# Start PostgreSQL with pgvector
docker-compose up -d

# The database will be automatically initialized with the required schema

3. Install Ollama and Pull the Model

# Install Ollama (see https://ollama.ai)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the embedding model
ollama pull nomic-embed-text

4. Configure Environment Variables

Create a .env file in the project root:

# PostgreSQL Configuration
PG_JDBC_URL=jdbc:postgresql://localhost:5432/knowledge_base
PG_USER=kb_user
PG_PASS=kb_password

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=nomic-embed-text

# Query Configuration
QUERY_TOP_K=5

# Ingestion Configuration (optional)
ROOT_DIR=docs

5. Build the Java Project

cd code
mvn clean package
cd ..

๐Ÿ“š Usage

Ingest Documentation

# Ingest all .txt and .md files from a directory
./read_documents.sh docs/your-documentation-folder

# Example output:
# Processing docs/admin-overview.md
# Inserted 5 chunks for admin-overview.md

Query the Knowledge Base

# Ask a question
./ask_question.sh "How do I manage users in WSO2?"

# Example output:
# Rank      1
# Distance  0.373
# Title     Manage Users and Roles
# File      managing-users.md
# Content   Admin users can log in to...

๐Ÿ—„๏ธ Database Schema

Documents Table

CREATE TABLE documents (
  id         uuid PRIMARY KEY,
  title      text,
  file_name  text,
  file_path  text,
  tags       text[],
  created_at timestamptz DEFAULT now()
);

Chunks Table

CREATE TABLE chunks (
  id           uuid PRIMARY KEY,
  document_id  uuid REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index  integer,
  heading_path text,
  content      text,
  embedding    vector(768)  -- nomic-embed-text dimension
);

CREATE INDEX chunks_embedding_idx ON chunks 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 100);

โš™๏ธ Configuration

Chunking Parameters

Edit FolderToPgvectorIngestor.java:

List<String> chunks = makeChunks(text, 
    1200,  // chunk size (characters)
    200    // overlap (characters)
);

Search Parameters

Edit .env to change number of results:

QUERY_TOP_K=10  # Return top 10 results instead of 5

Retry Configuration

Edit the embed() method in both Java files:

int maxRetries = 3;        // Number of retries
int retryDelayMs = 2000;   // Initial delay in ms
// Exponential backoff: 2s โ†’ 4s โ†’ 8s

๐Ÿ”ง Database Maintenance

After bulk ingestion, optimize your database:

psql -U kb_user -d knowledge_base
-- Update statistics and optimize indexes
VACUUM ANALYZE chunks;
VACUUM ANALYZE documents;
REINDEX TABLE chunks;
REINDEX TABLE documents;

-- Check database stats
SELECT 
    COUNT(*) as total_chunks,
    pg_size_pretty(pg_total_relation_size('chunks')) as total_size
FROM chunks;

Empty Test Data

TRUNCATE TABLE documents CASCADE;

๐Ÿ› Troubleshooting

Model Keeps Unloading

Problem: Ollama unloads the model between requests, causing 500 errors.

Solution 1: Set Ollama to keep models loaded

export OLLAMA_KEEP_ALIVE=-1
OR
export OLLAMA_KEEP_ALIVE=10m
# Restart Ollama

Connection Refused to Ollama

Problem: Cannot connect to Ollama server.

Solutions:

  • Ensure Ollama is running: ps aux | grep ollama
  • Check the URL in .env matches your Ollama instance
  • Test connectivity: curl http://localhost:11434/api/tags

Empty Embeddings (dimension 0)

Problem: Embeddings API returns empty array.

Cause: Using wrong API field name (input instead of prompt)

Solution: This is fixed in the provided code. Ensure you're using the latest version.

๐Ÿ“Š Performance Considerations

For 10,000+ Chunks

  • Run VACUUM ANALYZE after bulk inserts
  • Consider increasing lists parameter in ivfflat index for larger datasets
  • Monitor query performance: typical queries should be < 100ms

Vector Index Tuning

-- For larger datasets (100k+ chunks), increase lists
DROP INDEX chunks_embedding_idx;
CREATE INDEX chunks_embedding_idx ON chunks 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 1000);  -- Adjust based on dataset size

๐Ÿ”ฎ Future Enhancements You Could Do

  • Hybrid Search: Combine vector similarity with keyword search
  • Metadata Filtering: Filter by tags, file names, or dates
  • Reranking: Post-process results to prioritize exact matches
  • Web UI: Simple web interface for queries
  • Batch Queries: Process multiple questions at once
  • Document Updates: Handle document versioning and updates
  • Multiple Models: Support different embedding models
  • RAG Integration: Add LLM-based answer generation
  • API Server: REST API for queries

๐Ÿ“ Understanding Vector Search

Why isn't exact keyword match always rank #1?

Vector search finds semantic similarity, not keyword matches. A document about "user management and administration" might rank higher than one titled "admin-overview" because the content is semantically closer to your query.

For better exact matches, consider:

  • More descriptive queries: "What is the administration overview for WSO2?"
  • Hybrid search (vector + keyword)
  • Metadata filtering on file names

๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

๐Ÿ“„ License

Apache 2 License

๐Ÿ™ Acknowledgments

  • pgvector - Postgres extension for vector similarity search
  • Ollama - Local LLM and embedding server
  • nomic-embed-text - High-quality embedding model
  • WSO2 - Documentation source

๐Ÿ“ž Support

For issues and questions:

  • Open an issue on GitHub
  • Check existing issues for solutions
  • Review the Troubleshooting section above

AI Note

Yes, most of this was generated using AI since this is for learning a new skill. Is this the best way? Probably not. Is this production ready? NOPE! The intention is just to understand the basic concept of chunking and insert into a vector type database system.

About

Example chunking into a vector extension in postgres db LEARNING lab. THIS IS NOT FOR PRODUCTION!!!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors