WSO2 Documentation Knowledge Base

A vector database-powered semantic search system for WSO2 documentation. This project uses PostgreSQL with pgvector extension and Ollama embeddings to enable natural language queries over technical documentation.

🚀 Features

Semantic Search: Ask questions in natural language and get relevant documentation chunks
Local Embeddings: Uses Ollama with nomic-embed-text model for privacy and no API costs
Vector Similarity: Leverages PostgreSQL pgvector for efficient similarity search
Document Chunking: Automatically splits documents into searchable chunks with overlap
Retry Logic: Handles model loading/unloading with exponential backoff
Docker Support: Easy PostgreSQL setup with pgvector extension
Multiple Formats: Supports .txt and .md files

📋 Prerequisites

Java 17+
Maven 3.6+
Docker & Docker Compose
Ollama installed and running
PostgreSQL client tools (optional, for CLI access)

🏗️ Architecture

┌─────────────────┐
│  Documentation  │
│   (.md, .txt)   │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────┐
│  FolderToPgvectorIngestor   │
│  - Reads files              │
│  - Chunks text (1200/200)   │
│  - Generates embeddings     │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│      Ollama API             │
│  (nomic-embed-text model)   │
│  - 768-dimensional vectors  │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│   PostgreSQL + pgvector     │
│  - documents table          │
│  - chunks table             │
│  - Vector similarity index  │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│   QueryPgvectorChunks       │
│  - Natural language query   │
│  - Vector similarity search │
│  - Top-K results            │
└─────────────────────────────┘

🛠️ Installation

1. Clone the Repository

git clone https://github.com/BechtelCanDoIt/wso2-docs-kb.git
cd wso2-docs-kb

2. Set Up PostgreSQL with pgvector

# Start PostgreSQL with pgvector
docker-compose up -d

# The database will be automatically initialized with the required schema

3. Install Ollama and Pull the Model

# Install Ollama (see https://ollama.ai)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the embedding model
ollama pull nomic-embed-text

4. Configure Environment Variables

Create a .env file in the project root:

# PostgreSQL Configuration
PG_JDBC_URL=jdbc:postgresql://localhost:5432/knowledge_base
PG_USER=kb_user
PG_PASS=kb_password

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=nomic-embed-text

# Query Configuration
QUERY_TOP_K=5

# Ingestion Configuration (optional)
ROOT_DIR=docs

5. Build the Java Project

cd code
mvn clean package
cd ..

📚 Usage

Ingest Documentation

# Ingest all .txt and .md files from a directory
./read_documents.sh docs/your-documentation-folder

# Example output:
# Processing docs/admin-overview.md
# Inserted 5 chunks for admin-overview.md

Query the Knowledge Base

# Ask a question
./ask_question.sh "How do I manage users in WSO2?"

# Example output:
# Rank      1
# Distance  0.373
# Title     Manage Users and Roles
# File      managing-users.md
# Content   Admin users can log in to...

🗄️ Database Schema

Documents Table

CREATE TABLE documents (
  id         uuid PRIMARY KEY,
  title      text,
  file_name  text,
  file_path  text,
  tags       text[],
  created_at timestamptz DEFAULT now()
);

Chunks Table

CREATE TABLE chunks (
  id           uuid PRIMARY KEY,
  document_id  uuid REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index  integer,
  heading_path text,
  content      text,
  embedding    vector(768)  -- nomic-embed-text dimension
);

CREATE INDEX chunks_embedding_idx ON chunks 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 100);

⚙️ Configuration

Chunking Parameters

Edit FolderToPgvectorIngestor.java:

List<String> chunks = makeChunks(text, 
    1200,  // chunk size (characters)
    200    // overlap (characters)
);

Search Parameters

Edit .env to change number of results:

QUERY_TOP_K=10  # Return top 10 results instead of 5

Retry Configuration

Edit the embed() method in both Java files:

int maxRetries = 3;        // Number of retries
int retryDelayMs = 2000;   // Initial delay in ms
// Exponential backoff: 2s → 4s → 8s

🔧 Database Maintenance

After bulk ingestion, optimize your database:

psql -U kb_user -d knowledge_base

-- Update statistics and optimize indexes
VACUUM ANALYZE chunks;
VACUUM ANALYZE documents;
REINDEX TABLE chunks;
REINDEX TABLE documents;

-- Check database stats
SELECT 
    COUNT(*) as total_chunks,
    pg_size_pretty(pg_total_relation_size('chunks')) as total_size
FROM chunks;

Empty Test Data

TRUNCATE TABLE documents CASCADE;

🐛 Troubleshooting

Model Keeps Unloading

Problem: Ollama unloads the model between requests, causing 500 errors.

Solution 1: Set Ollama to keep models loaded

export OLLAMA_KEEP_ALIVE=-1
OR
export OLLAMA_KEEP_ALIVE=10m
# Restart Ollama

Connection Refused to Ollama

Problem: Cannot connect to Ollama server.

Solutions:

Ensure Ollama is running: ps aux | grep ollama
Check the URL in .env matches your Ollama instance
Test connectivity: curl http://localhost:11434/api/tags

Empty Embeddings (dimension 0)

Problem: Embeddings API returns empty array.

Cause: Using wrong API field name (input instead of prompt)

Solution: This is fixed in the provided code. Ensure you're using the latest version.

📊 Performance Considerations

For 10,000+ Chunks

Run VACUUM ANALYZE after bulk inserts
Consider increasing lists parameter in ivfflat index for larger datasets
Monitor query performance: typical queries should be < 100ms

Vector Index Tuning

-- For larger datasets (100k+ chunks), increase lists
DROP INDEX chunks_embedding_idx;
CREATE INDEX chunks_embedding_idx ON chunks 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 1000);  -- Adjust based on dataset size

🔮 Future Enhancements You Could Do

Hybrid Search: Combine vector similarity with keyword search
Metadata Filtering: Filter by tags, file names, or dates
Reranking: Post-process results to prioritize exact matches
Web UI: Simple web interface for queries
Batch Queries: Process multiple questions at once
Document Updates: Handle document versioning and updates
Multiple Models: Support different embedding models
RAG Integration: Add LLM-based answer generation
API Server: REST API for queries

📝 Understanding Vector Search

Why isn't exact keyword match always rank #1?

Vector search finds semantic similarity, not keyword matches. A document about "user management and administration" might rank higher than one titled "admin-overview" because the content is semantically closer to your query.

For better exact matches, consider:

More descriptive queries: "What is the administration overview for WSO2?"
Hybrid search (vector + keyword)
Metadata filtering on file names

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request

📄 License

Apache 2 License

🙏 Acknowledgments

pgvector - Postgres extension for vector similarity search
Ollama - Local LLM and embedding server
nomic-embed-text - High-quality embedding model
WSO2 - Documentation source

📞 Support

For issues and questions:

Open an issue on GitHub
Check existing issues for solutions
Review the Troubleshooting section above

AI Note

Yes, most of this was generated using AI since this is for learning a new skill. Is this the best way? Probably not. Is this production ready? NOPE! The intention is just to understand the basic concept of chunking and insert into a vector type database system.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
code		code
db		db
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ask_question.sh		ask_question.sh
demo.env		demo.env
read_documents.sh		read_documents.sh

Folders and files

Latest commit

History

Repository files navigation

WSO2 Documentation Knowledge Base

🚀 Features

📋 Prerequisites

🏗️ Architecture

🛠️ Installation

1. Clone the Repository

2. Set Up PostgreSQL with pgvector

3. Install Ollama and Pull the Model

4. Configure Environment Variables

5. Build the Java Project

📚 Usage

Ingest Documentation

Query the Knowledge Base

🗄️ Database Schema

Documents Table

Chunks Table

⚙️ Configuration

Chunking Parameters

Search Parameters

Retry Configuration

🔧 Database Maintenance

Empty Test Data

🐛 Troubleshooting

Model Keeps Unloading

Connection Refused to Ollama

Empty Embeddings (dimension 0)

📊 Performance Considerations

For 10,000+ Chunks

Vector Index Tuning

🔮 Future Enhancements You Could Do

📝 Understanding Vector Search

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

AI Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages