Intellecta RAG System

A production-grade, offline-capable Retrieval-Augmented Generation (RAG) system designed for secure, on-premise document analysis and intelligent question-answering. Built entirely on open-source technologies, enabling organizations to deploy AI-powered document intelligence without cloud dependencies.

📋 Table of Contents

Executive Summary
System Architecture
Technology Stack
Features
Installation
Running the Application
Document Ingestion Pipeline
Chunking Strategy
Embedding Generation
Vector Storage & Retrieval
LLM Reasoning & Response Generation
Security Framework
Multi-Language Support
Evaluation Metrics
API Documentation
Performance Optimization
Project Structure
Troubleshooting

Executive Summary

Intellecta is a production-grade, offline-capable Retrieval-Augmented Generation (RAG) system designed for secure, on-premise document analysis and intelligent question-answering. Built entirely on open-source technologies, it enables organizations to deploy AI-powered document intelligence without cloud dependencies, ensuring data sovereignty and compliance with air-gapped security requirements.

Key Capabilities

Capability	Description
Document Intelligence	Process PDF, DOCX, CSV, Excel, and more
Semantic Search	Find relevant information using AI embeddings
AI-Powered Q&A	Get intelligent answers grounded in your documents
Security Controls	5-level security clearance system
Multi-Language	English, Korean, Vietnamese support
Offline Operation	No cloud dependencies, air-gapped ready

System Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              USER INTERFACE                                 │
│                         (React + TypeScript + Vite)                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │  Dashboard  │  │Query/Response│ │Doc Ingestion│  │  History & Logs     │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              REST API LAYER                                 │
│                            (FastAPI + Python)                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │ /query      │  │ /ingest     │  │ /documents  │  │/security/auto-detect│ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    ▼                 ▼                 ▼
┌───────────────────────┐ ┌───────────────────┐ ┌───────────────────────────┐
│   INGESTION PIPELINE  │ │  RAG ORCHESTRATOR │ │   SECURITY FRAMEWORK      │
│  ┌─────────────────┐  │ │ ┌───────────────┐ │ │  ┌─────────────────────┐  │
│  │ Document Parser │  │ │ │Query Embedding│ │ │  │ Pattern Detection   │  │
│  │ (PDF,CSV,DOCX)  │  │ │ └───────────────┘ │ │  │ (SSN, Salary, etc.) │  │
│  └─────────────────┘  │ │        │          │ │  └─────────────────────┘  │
│          │            │ │        ▼          │ │            │              │
│          ▼            │ │ ┌───────────────┐ │ │            ▼              │
│  ┌─────────────────┐  │ │ │Vector Search  │ │ │  ┌─────────────────────┐  │
│  │ Text Chunking   │  │ │ │  (pgvector)   │ │ │  │ Clearance Levels    │  │
│  │ (512 tokens)    │  │ │ └───────────────┘ │ │  │ (PUBLIC→TOP_SECRET) │  │
│  └─────────────────┘  │ │        │          │ │  └─────────────────────┘  │
│          │            │ │        ▼          │ │                           │
│          ▼            │ │ ┌───────────────┐ │ └───────────────────────────┘
│  ┌─────────────────┐  │ │ │Context Build  │ │
│  │ E5 Embedding    │  │ │ └───────────────┘ │
│  │ (1024-dim)      │  │ │        │          │
│  └─────────────────┘  │ │        ▼          │
│          │            │ │ ┌───────────────┐ │
│          ▼            │ │ │ LLM Reasoning │ │
│  ┌─────────────────┐  │ │ │ (LLaMA 3 8B)  │ │
│  │ Vector Storage  │  │ │ └───────────────┘ │
│  │   (pgvector)    │  │ │        │          │
│  └─────────────────┘  │ │        ▼          │
└───────────────────────┘ │ ┌───────────────┐ │
                          │ │ Translation   │ │
                          │ │ (Mistral 7B)  │ │
                          │ └───────────────┘ │
                          └───────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              DATA LAYER                                     │
│  ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐  │
│  │  PostgreSQL +       │  │  Document Registry  │  │   Query History     │  │
│  │  pgvector           │  │  (JSON)             │  │   (JSON)            │  │
│  └─────────────────────┘  └─────────────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           LLM INFERENCE LAYER                               │
│                              (Ollama Runtime)                               │
│  ┌─────────────────────────────┐  ┌─────────────────────────────────────┐   │
│  │  LLaMA 3 8B (4.6 GB)        │  │  Mistral 7B (4.1 GB)                │   │
│  │  - Reasoning                │  │  - Translation (Quality Mode)       │   │
│  │  - Answer Generation        │  │  - Refinement                       │   │
│  └─────────────────────────────┘  └─────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Data Flow

User Query → Embedding → Vector Search → Context Assembly → LLM Reasoning → Response
     │            │            │                │                │            │
     └── Security Check ───────┴──── Chunk Filtering ────────────┴── Translation

Technology Stack

Backend Technologies

Component	Technology	Version	Purpose
Framework	FastAPI	0.104+	REST API, async support
Language	Python	3.11+	Core programming
Database	PostgreSQL	15+	Relational storage
Vector DB	pgvector	0.5+	Similarity search
LLM Runtime	Ollama	0.1+	Local model inference
Embeddings	sentence-transformers	2.2+	Text embeddings

Frontend Technologies

Component	Technology	Version	Purpose
Framework	React	18+	UI components
Build Tool	Vite	5+	Fast development
Language	TypeScript	5+	Type safety
Styling	Tailwind CSS	3+	Utility-first CSS
Components	shadcn/ui	latest	UI component library
Charts	Recharts	2+	Data visualization

AI/ML Models

Model	Parameters	Size	License	Purpose
LLaMA 3 8B	8 Billion	4.6 GB	Meta Open	Reasoning, Generation
Mistral 7B	7 Billion	4.1 GB	Apache 2.0	Translation, Refinement
E5-large-v2	335 Million	1.3 GB	MIT	Text Embeddings

Features

🔄 Dual LLM Mode Switcher

Toggle between Fast and Quality modes directly from the UI:

⚡ Fast Mode: Uses LLaMA 3 8B for all tasks (~30-60s per query)
🔬 Quality Mode: Uses LLaMA 3 8B + Mistral 7B for better translations (~40-90s per query)

🔐 Dual Security Checking

Security is enforced at two levels:

Query Analysis: Scans query text for sensitive keywords
Document Analysis: Scans retrieved content for sensitive patterns
Effective Level: Uses the HIGHER of query or document security

🌐 Multi-Language Support

English 🇺🇸 - Native support
Korean 🇰🇷 - Full translation pipeline
Vietnamese 🇻🇳 - Full translation pipeline

📊 Real-Time Metrics

Accuracy, Precision, Efficiency, Throughput scores
High-quality chunk ratio
Retrieval and generation timing

📄 Document Selection

Filter queries to specific documents
Multi-select document picker
Auto-detect security level from content

📜 Query History

Persistent history with timestamps
Replay previous queries
Delete individual entries

📈 Dashboard Analytics

System status monitoring
Performance charts
Document statistics
Downloadable reports (Markdown format)

Installation

Prerequisites

Python 3.11+
Node.js 18+
PostgreSQL 15+ with pgvector extension
Ollama for local LLM inference

1. Clone Repository

git clone https://github.com/Mansoryq/Capestone.git
cd Capestone

2. Install Ollama and Models

# Install Ollama (macOS)
brew install ollama

# Start Ollama service
ollama serve

# Pull required models (in another terminal)
ollama pull llama3:8b
ollama pull mistral:latest

3. Setup PostgreSQL with pgvector

# Using Docker (recommended)
docker run -d --name pgvector \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=energy_ai \
  -p 5432:5432 \
  ankane/pgvector

# Create extension
psql -h localhost -U postgres -d energy_ai -c "CREATE EXTENSION IF NOT EXISTS vector;"

4. Setup Backend

cd backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

5. Setup Frontend

cd Global_Capstone_Frontend

# Install dependencies
npm install
# or
bun install

Running the Application

Option 1: Fast Mode (Recommended for Development)

# Terminal 1: Backend
cd backend
./start_fast.sh
# or manually:
# FAST_MODE=true python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# Terminal 2: Frontend
cd Global_Capstone_Frontend
npm run dev -- --port 8082

Option 2: Quality Mode (Better Translations)

# Terminal 1: Backend
cd backend
./start_quality.sh

# Terminal 2: Frontend
cd Global_Capstone_Frontend
npm run dev -- --port 8082

Access Points

Frontend: http://localhost:8082
Backend API: http://localhost:8000
API Docs: http://localhost:8000/docs

Document Ingestion Pipeline

Supported File Formats

Format	Extension	Parser	Features
PDF	.pdf	PyMuPDF (fitz)	Text, tables, images, OCR
Word	.docx	python-docx	Text, tables, formatting
Excel	.xlsx	openpyxl	Sheets, formulas, data
CSV	.csv	pandas	Structured data
Text	.txt	native	Plain text
Markdown	.md	native	Formatted text
JSON	.json	native	Structured data

Ingestion Process

File Validation - Check file extension and size
Content Extraction - Parse text from document
Text Preprocessing - Normalize and clean text
Chunking - Split into 512-token segments
Embedding Generation - Create 1024-dim vectors
Vector Storage - Store in PostgreSQL with pgvector
Metadata Registration - Track document info

Chunking Strategy

Configuration

Parameter	Value	Rationale
Chunk Size	512 tokens	Optimal for E5 model context
Chunk Overlap	50 tokens	Preserves context at boundaries
Min Chunk Size	100 tokens	Avoids fragmentary chunks
Separator	Sentence boundaries	Semantic coherence

Quality Metrics

Metric	Target	Measurement
Avg Chunk Size	450-512 tokens	Mean token count
Size Variance	< 20%	Standard deviation
Semantic Coherence	> 0.7	Sentence boundary alignment

Embedding Generation

Model: intfloat/e5-large-v2

Attribute	Value
Dimensions	1024
Max Sequence	512 tokens
Parameters	335M
License	MIT
Benchmark (MTEB)	63.3% avg

E5 Prefix Convention

# For documents/passages
prefixed_text = f"passage: {text}"

# For queries
prefixed_query = f"query: {text}"

Vector Storage & Retrieval

PostgreSQL + pgvector

-- Documents table with vector column
CREATE TABLE public.documents (
    id SERIAL PRIMARY KEY,
    text TEXT NOT NULL,
    embedding vector(1024),
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

-- IVFFlat index for fast similarity search
CREATE INDEX ON public.documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Retrieval Quality Thresholds

Quality Tier	Distance Range	Classification
Excellent	< 0.15	Highly relevant
Good	0.15 - 0.25	Relevant
Acceptable	0.25 - 0.35	Marginally relevant
Filtered	> 0.35	Excluded

LLM Reasoning & Response Generation

Dual-Mode Architecture

Mode	Reasoning	Translation	Avg Latency
⚡ Fast	LLaMA 3 8B	LLaMA 3 8B	30-60s
🔬 Quality	LLaMA 3 8B	Mistral 7B	40-90s

RAG Pipeline Steps

Security Analysis - Check query and document sensitivity
Vector Retrieval - Find relevant chunks
Chunk Filtering - Apply security and quality filters
Context Assembly - Build prompt with sources
LLM Reasoning - Generate answer
Translation - Convert to target language (if needed)
Metrics Calculation - Compute quality scores

Security Framework

Security Levels

Level	Value	Description	Example Content
PUBLIC	1	Open access	General documentation
INTERNAL	2	Organization only	Internal processes
CONFIDENTIAL	3	Restricted	Financial data
RESTRICTED	4	Highly restricted	Personal data (SSN)
TOP_SECRET	5	Maximum security	Critical infrastructure

Dual Security Checking

Query Analysis → Document Analysis → Effective Level = MAX(query, document)

If user clearance < effective level → Access Denied

Multi-Language Support

Supported Languages

Language	Code	Translation	Response
English	en	Not needed	Native
Korean	ko	Query → EN, Response → KO	Full support
Vietnamese	vi	Query → EN, Response → VI	Full support

Evaluation Metrics

Retrieval Metrics

Metric	Formula	Target	Description
Accuracy	`100 - (avg_distance × 40)`	> 90%	How close chunks are to query
Precision	`85 + weighted_quality`	> 90%	Quality tier distribution
Efficiency	`100 - (time/3.0 × 10)`	> 90%	Retrieval speed
Throughput	`90 + (chunks/sec × 2)`	> 90%	Processing rate

Latency Breakdown

Stage	Target
Query Embedding	< 100ms
Vector Search	< 500ms
Security Check	< 50ms
LLM Reasoning	< 60s
Translation	< 30s

API Documentation

Endpoints Overview

Method	Endpoint	Description
`GET`	`/status`	System health status
`GET`	`/config`	System configuration
`POST`	`/query`	Submit RAG query
`POST`	`/ingest`	Upload document
`GET`	`/documents`	List all documents
`DELETE`	`/documents/{id}`	Delete document
`GET`	`/query/history`	Get query history
`POST`	`/security/auto-detect`	Detect document security
`GET`	`/stats`	Data statistics

Query Endpoint

Request:

POST /query
{
  "query": "What is the power plant capacity?",
  "language": "en",
  "security_clearance": "CONFIDENTIAL",
  "document_ids": ["doc_123"],
  "fast_mode": true
}

Response:

{
  "answer": "The power plant has a capacity of 500 MW...",
  "sources": ["power_plant_data.pdf"],
  "retrieval_time_ms": 245,
  "generation_time_ms": 32000,
  "fast_mode": true,
  "model_used": "llama3:8b",
  "security": {
    "level": "CONFIDENTIAL",
    "access_allowed": true
  },
  "chunks_used": 5,
  "metrics": {
    "accuracy": 92.5,
    "precision": 95.0
  }
}

Performance Optimization

Model Warmup

Models are pre-loaded at startup for faster first query:

def warmup_models():
    """Pre-load models at startup"""
    requests.post("http://localhost:11434/api/generate", json={
        "model": "llama3:8b",
        "prompt": "Hello",
        "options": {"num_predict": 1}
    })

Database Indexes

CREATE INDEX idx_documents_doc_id ON public.documents ((metadata->>'doc_id'));
CREATE INDEX idx_documents_source ON public.documents ((metadata->>'source'));

Project Structure

capestone/
├── backend/
│   ├── main.py                 # FastAPI application
│   ├── mistral_rag.py          # RAG orchestrator
│   ├── document_ingest.py      # Document processing
│   ├── embed_e5.py             # Embedding generation
│   ├── retrieve_pgvector.py    # Vector retrieval
│   ├── security_mapping.py     # Security framework
│   ├── requirements.txt        # Python dependencies
│   ├── start_fast.sh           # Fast mode startup
│   ├── start_quality.sh        # Quality mode startup
│   └── data/
│       ├── documents_registry.json
│       ├── query_history.json
│       └── uploads/
│
├── Global_Capstone_Frontend/
│   ├── src/
│   │   ├── pages/
│   │   │   ├── Dashboard.tsx
│   │   │   ├── QueryResponse.tsx
│   │   │   └── DocumentIngestion.tsx
│   │   ├── components/
│   │   ├── services/
│   │   │   └── api.ts
│   │   └── lib/
│   ├── package.json
│   └── vite.config.ts
│
├── README.md
├── FEATURES.md
└── COMPLIANCE.md

Troubleshooting

Issue	Solution
"No relevant information found"	Lower `max_distance` threshold, check document ingestion
Slow response times	Use Fast mode, reduce `top_k`, check CPU load
Security access denied	Increase user clearance, check document security
Model not responding	Restart Ollama, check model is pulled
Database connection error	Verify PostgreSQL is running
Frontend not loading	Check if backend is running on port 8000

Common Commands

# Check Ollama models
ollama list

# Check PostgreSQL connection
psql -h localhost -U postgres -d energy_ai -c "SELECT COUNT(*) FROM documents;"

# Restart backend
cd backend && pkill -f "uvicorn main:app" && ./start_fast.sh

# Clear query history
curl -X DELETE http://localhost:8000/query/history

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is built entirely on open-source technologies. See COMPLIANCE.md for full license details.

👥 Authors

Abylay Turganbekov (Co-Leader)
Harishik Dev Singh (Team Leader)
Aikanym Baisalova
Zhangali Otegaliev
Alvin.K
오민혁

Document Version: 1.0.0 Last Updated: January 2026

Note: This system is designed for CPU inference. For faster performance, consider using a GPU with CUDA-enabled Ollama installation.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Global_Capstone_Frontend		Global_Capstone_Frontend
backend		backend
.gitignore		.gitignore
COMPLIANCE.md		COMPLIANCE.md
FEATURES.md		FEATURES.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation