A production-grade, offline-capable Retrieval-Augmented Generation (RAG) system designed for secure, on-premise document analysis and intelligent question-answering. Built entirely on open-source technologies, enabling organizations to deploy AI-powered document intelligence without cloud dependencies.
Executive Summary
System Architecture
Technology Stack
Features
Installation
Running the Application
Document Ingestion Pipeline
Chunking Strategy
Embedding Generation
Vector Storage & Retrieval
LLM Reasoning & Response Generation
Security Framework
Multi-Language Support
Evaluation Metrics
API Documentation
Performance Optimization
Project Structure
Troubleshooting
Intellecta is a production-grade, offline-capable Retrieval-Augmented Generation (RAG) system designed for secure, on-premise document analysis and intelligent question-answering. Built entirely on open-source technologies, it enables organizations to deploy AI-powered document intelligence without cloud dependencies, ensuring data sovereignty and compliance with air-gapped security requirements.
Capability
Description
Document Intelligence
Process PDF, DOCX, CSV, Excel, and more
Semantic Search
Find relevant information using AI embeddings
AI-Powered Q&A
Get intelligent answers grounded in your documents
Security Controls
5-level security clearance system
Multi-Language
English, Korean, Vietnamese support
Offline Operation
No cloud dependencies, air-gapped ready
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER INTERFACE β
β (React + TypeScript + Vite) β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β Dashboard β βQuery/Responseβ βDoc Ingestionβ β History & Logs β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REST API LAYER β
β (FastAPI + Python) β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β /query β β /ingest β β /documents β β/security/auto-detectβ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββββββββββ
β INGESTION PIPELINE β β RAG ORCHESTRATOR β β SECURITY FRAMEWORK β
β βββββββββββββββββββ β β βββββββββββββββββ β β βββββββββββββββββββββββ β
β β Document Parser β β β βQuery Embeddingβ β β β Pattern Detection β β
β β (PDF,CSV,DOCX) β β β βββββββββββββββββ β β β (SSN, Salary, etc.) β β
β βββββββββββββββββββ β β β β β βββββββββββββββββββββββ β
β β β β βΌ β β β β
β βΌ β β βββββββββββββββββ β β βΌ β
β βββββββββββββββββββ β β βVector Search β β β βββββββββββββββββββββββ β
β β Text Chunking β β β β (pgvector) β β β β Clearance Levels β β
β β (512 tokens) β β β βββββββββββββββββ β β β (PUBLICβTOP_SECRET) β β
β βββββββββββββββββββ β β β β β βββββββββββββββββββββββ β
β β β β βΌ β β β
β βΌ β β βββββββββββββββββ β βββββββββββββββββββββββββββββ
β βββββββββββββββββββ β β βContext Build β β
β β E5 Embedding β β β βββββββββββββββββ β
β β (1024-dim) β β β β β
β βββββββββββββββββββ β β βΌ β
β β β β βββββββββββββββββ β
β βΌ β β β LLM Reasoning β β
β βββββββββββββββββββ β β β (LLaMA 3 8B) β β
β β Vector Storage β β β βββββββββββββββββ β
β β (pgvector) β β β β β
β βββββββββββββββββββ β β βΌ β
βββββββββββββββββββββββββ β βββββββββββββββββ β
β β Translation β β
β β (Mistral 7B) β β
β βββββββββββββββββ β
βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LAYER β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β PostgreSQL + β β Document Registry β β Query History β β
β β pgvector β β (JSON) β β (JSON) β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM INFERENCE LAYER β
β (Ollama Runtime) β
β βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β LLaMA 3 8B (4.6 GB) β β Mistral 7B (4.1 GB) β β
β β - Reasoning β β - Translation (Quality Mode) β β
β β - Answer Generation β β - Refinement β β
β βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
User Query β Embedding β Vector Search β Context Assembly β LLM Reasoning β Response
β β β β β β
βββ Security Check ββββββββ΄ββββ Chunk Filtering βββββββββββββ΄ββ Translation
Component
Technology
Version
Purpose
Framework
FastAPI
0.104+
REST API, async support
Language
Python
3.11+
Core programming
Database
PostgreSQL
15+
Relational storage
Vector DB
pgvector
0.5+
Similarity search
LLM Runtime
Ollama
0.1+
Local model inference
Embeddings
sentence-transformers
2.2+
Text embeddings
Component
Technology
Version
Purpose
Framework
React
18+
UI components
Build Tool
Vite
5+
Fast development
Language
TypeScript
5+
Type safety
Styling
Tailwind CSS
3+
Utility-first CSS
Components
shadcn/ui
latest
UI component library
Charts
Recharts
2+
Data visualization
Model
Parameters
Size
License
Purpose
LLaMA 3 8B
8 Billion
4.6 GB
Meta Open
Reasoning, Generation
Mistral 7B
7 Billion
4.1 GB
Apache 2.0
Translation, Refinement
E5-large-v2
335 Million
1.3 GB
MIT
Text Embeddings
π Dual LLM Mode Switcher
Toggle between Fast and Quality modes directly from the UI:
β‘ Fast Mode : Uses LLaMA 3 8B for all tasks (~30-60s per query)
π¬ Quality Mode : Uses LLaMA 3 8B + Mistral 7B for better translations (~40-90s per query)
π Dual Security Checking
Security is enforced at two levels:
Query Analysis : Scans query text for sensitive keywords
Document Analysis : Scans retrieved content for sensitive patterns
Effective Level : Uses the HIGHER of query or document security
π Multi-Language Support
English πΊπΈ - Native support
Korean π°π· - Full translation pipeline
Vietnamese π»π³ - Full translation pipeline
Accuracy, Precision, Efficiency, Throughput scores
High-quality chunk ratio
Retrieval and generation timing
Filter queries to specific documents
Multi-select document picker
Auto-detect security level from content
Persistent history with timestamps
Replay previous queries
Delete individual entries
System status monitoring
Performance charts
Document statistics
Downloadable reports (Markdown format)
Python 3.11+
Node.js 18+
PostgreSQL 15+ with pgvector extension
Ollama for local LLM inference
git clone https://github.com/Mansoryq/Capestone.git
cd Capestone
2. Install Ollama and Models
# Install Ollama (macOS)
brew install ollama
# Start Ollama service
ollama serve
# Pull required models (in another terminal)
ollama pull llama3:8b
ollama pull mistral:latest
3. Setup PostgreSQL with pgvector
# Using Docker (recommended)
docker run -d --name pgvector \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=energy_ai \
-p 5432:5432 \
ankane/pgvector
# Create extension
psql -h localhost -U postgres -d energy_ai -c " CREATE EXTENSION IF NOT EXISTS vector;"
cd backend
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
cd Global_Capstone_Frontend
# Install dependencies
npm install
# or
bun install
Option 1: Fast Mode (Recommended for Development)
# Terminal 1: Backend
cd backend
./start_fast.sh
# or manually:
# FAST_MODE=true python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reload
# Terminal 2: Frontend
cd Global_Capstone_Frontend
npm run dev -- --port 8082
Option 2: Quality Mode (Better Translations)
# Terminal 1: Backend
cd backend
./start_quality.sh
# Terminal 2: Frontend
cd Global_Capstone_Frontend
npm run dev -- --port 8082
Document Ingestion Pipeline
Format
Extension
Parser
Features
PDF
.pdf
PyMuPDF (fitz)
Text, tables, images, OCR
Word
.docx
python-docx
Text, tables, formatting
Excel
.xlsx
openpyxl
Sheets, formulas, data
CSV
.csv
pandas
Structured data
Text
.txt
native
Plain text
Markdown
.md
native
Formatted text
JSON
.json
native
Structured data
File Validation - Check file extension and size
Content Extraction - Parse text from document
Text Preprocessing - Normalize and clean text
Chunking - Split into 512-token segments
Embedding Generation - Create 1024-dim vectors
Vector Storage - Store in PostgreSQL with pgvector
Metadata Registration - Track document info
Parameter
Value
Rationale
Chunk Size
512 tokens
Optimal for E5 model context
Chunk Overlap
50 tokens
Preserves context at boundaries
Min Chunk Size
100 tokens
Avoids fragmentary chunks
Separator
Sentence boundaries
Semantic coherence
Metric
Target
Measurement
Avg Chunk Size
450-512 tokens
Mean token count
Size Variance
< 20%
Standard deviation
Semantic Coherence
> 0.7
Sentence boundary alignment
Model: intfloat/e5-large-v2
Attribute
Value
Dimensions
1024
Max Sequence
512 tokens
Parameters
335M
License
MIT
Benchmark (MTEB)
63.3% avg
# For documents/passages
prefixed_text = f"passage: { text } "
# For queries
prefixed_query = f"query: { text } "
Vector Storage & Retrieval
-- Documents table with vector column
CREATE TABLE public .documents (
id SERIAL PRIMARY KEY ,
text TEXT NOT NULL ,
embedding vector(1024 ),
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- IVFFlat index for fast similarity search
CREATE INDEX ON public .documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100 );
Retrieval Quality Thresholds
Quality Tier
Distance Range
Classification
Excellent
< 0.15
Highly relevant
Good
0.15 - 0.25
Relevant
Acceptable
0.25 - 0.35
Marginally relevant
Filtered
> 0.35
Excluded
LLM Reasoning & Response Generation
Mode
Reasoning
Translation
Avg Latency
β‘ Fast
LLaMA 3 8B
LLaMA 3 8B
30-60s
π¬ Quality
LLaMA 3 8B
Mistral 7B
40-90s
Security Analysis - Check query and document sensitivity
Vector Retrieval - Find relevant chunks
Chunk Filtering - Apply security and quality filters
Context Assembly - Build prompt with sources
LLM Reasoning - Generate answer
Translation - Convert to target language (if needed)
Metrics Calculation - Compute quality scores
Level
Value
Description
Example Content
PUBLIC
1
Open access
General documentation
INTERNAL
2
Organization only
Internal processes
CONFIDENTIAL
3
Restricted
Financial data
RESTRICTED
4
Highly restricted
Personal data (SSN)
TOP_SECRET
5
Maximum security
Critical infrastructure
Query Analysis β Document Analysis β Effective Level = MAX(query, document)
If user clearance < effective level β Access Denied
Language
Code
Translation
Response
English
en
Not needed
Native
Korean
ko
Query β EN, Response β KO
Full support
Vietnamese
vi
Query β EN, Response β VI
Full support
Metric
Formula
Target
Description
Accuracy
100 - (avg_distance Γ 40)
> 90%
How close chunks are to query
Precision
85 + weighted_quality
> 90%
Quality tier distribution
Efficiency
100 - (time/3.0 Γ 10)
> 90%
Retrieval speed
Throughput
90 + (chunks/sec Γ 2)
> 90%
Processing rate
Stage
Target
Query Embedding
< 100ms
Vector Search
< 500ms
Security Check
< 50ms
LLM Reasoning
< 60s
Translation
< 30s
Method
Endpoint
Description
GET
/status
System health status
GET
/config
System configuration
POST
/query
Submit RAG query
POST
/ingest
Upload document
GET
/documents
List all documents
DELETE
/documents/{id}
Delete document
GET
/query/history
Get query history
POST
/security/auto-detect
Detect document security
GET
/stats
Data statistics
Request:
POST /query
{
"query" : " What is the power plant capacity?" ,
"language" : " en" ,
"security_clearance" : " CONFIDENTIAL" ,
"document_ids" : [" doc_123" ],
"fast_mode" : true
}
Response:
{
"answer" : " The power plant has a capacity of 500 MW..." ,
"sources" : [" power_plant_data.pdf" ],
"retrieval_time_ms" : 245 ,
"generation_time_ms" : 32000 ,
"fast_mode" : true ,
"model_used" : " llama3:8b" ,
"security" : {
"level" : " CONFIDENTIAL" ,
"access_allowed" : true
},
"chunks_used" : 5 ,
"metrics" : {
"accuracy" : 92.5 ,
"precision" : 95.0
}
}
Models are pre-loaded at startup for faster first query:
def warmup_models ():
"""Pre-load models at startup"""
requests .post ("http://localhost:11434/api/generate" , json = {
"model" : "llama3:8b" ,
"prompt" : "Hello" ,
"options" : {"num_predict" : 1 }
})
CREATE INDEX idx_documents_doc_id ON public .documents ((metadata- >> ' doc_id' ));
CREATE INDEX idx_documents_source ON public .documents ((metadata- >> ' source' ));
capestone/
βββ backend/
β βββ main.py # FastAPI application
β βββ mistral_rag.py # RAG orchestrator
β βββ document_ingest.py # Document processing
β βββ embed_e5.py # Embedding generation
β βββ retrieve_pgvector.py # Vector retrieval
β βββ security_mapping.py # Security framework
β βββ requirements.txt # Python dependencies
β βββ start_fast.sh # Fast mode startup
β βββ start_quality.sh # Quality mode startup
β βββ data/
β βββ documents_registry.json
β βββ query_history.json
β βββ uploads/
β
βββ Global_Capstone_Frontend/
β βββ src/
β β βββ pages/
β β β βββ Dashboard.tsx
β β β βββ QueryResponse.tsx
β β β βββ DocumentIngestion.tsx
β β βββ components/
β β βββ services/
β β β βββ api.ts
β β βββ lib/
β βββ package.json
β βββ vite.config.ts
β
βββ README.md
βββ FEATURES.md
βββ COMPLIANCE.md
Issue
Solution
"No relevant information found"
Lower max_distance threshold, check document ingestion
Slow response times
Use Fast mode, reduce top_k, check CPU load
Security access denied
Increase user clearance, check document security
Model not responding
Restart Ollama, check model is pulled
Database connection error
Verify PostgreSQL is running
Frontend not loading
Check if backend is running on port 8000
# Check Ollama models
ollama list
# Check PostgreSQL connection
psql -h localhost -U postgres -d energy_ai -c " SELECT COUNT(*) FROM documents;"
# Restart backend
cd backend && pkill -f " uvicorn main:app" && ./start_fast.sh
# Clear query history
curl -X DELETE http://localhost:8000/query/history
Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request
This project is built entirely on open-source technologies. See COMPLIANCE.md for full license details.
Abylay Turganbekov (Co-Leader)
Harishik Dev Singh (Team Leader)
Aikanym Baisalova
Zhangali Otegaliev
Alvin.K
μ€λ―Όν
Document Version: 1.0.0
Last Updated: January 2026
Note : This system is designed for CPU inference. For faster performance, consider using a GPU with CUDA-enabled Ollama installation.