A university project by Team DEEPCROP that builds a Legal LLM assistant for Sri Lanka. It answers questions related to Companies Act, Inland Revenue Act, and Labor Laws with accurate citations and context.
We designed a Retrieval-Augmented Generation (RAG) system that extracts, indexes, and retrieves Sri Lankan legal knowledge from official documents. Our system goes beyond a basic RAG pipeline by adopting foundation model best practices for reliability, accuracy, and user experience.
- 📑 State-of-the-art extraction with Docling Extracts structured knowledge (titles, sections, content) from large PDFs.
- 🧠 VectorDB with FAISS Stores embeddings of Sri Lankan legal documents for fast, semantic retrieval.
- 🎯 Query Optimization Before retrieval, user queries are rewritten into context-rich, expressive forms, improving accuracy.
- 💬 Chat Memory (LangChain) Keeps conversation history for natural, context-aware dialogues.
- 🤖 Well-designed prompts Role assignment, system prompts, and few-shot examples ensure consistent legal answers.
- 🔒 Domain-bound RAG Answers strictly within legal context. If out of scope, the assistant explains why.
.
├── extractor/ # PDF → structured knowledge (Docling)
│ ├── companies_act.pdf
│ ├── inland_rev.pdf
│ ├── labor_laws.pdf
│ ├── extract_from_docs.ipynb / .py
│ ├── requirements.txt
│ └── scratch/
│
├── server/ # FastAPI backend with LangChain + FAISS
│ ├── app/
│ │ ├── api.py # API endpoints (chat, ingest)
│ │ ├── pipeline.py # Query rewrite, RAG pipeline
│ │ ├── prompts.py # System & query prompts
│ │ ├── vectorstore.py # FAISS index builder/loader
│ │ └── schemas.py
│ ├── docs/ # Extracted legal documents (Markdown)
│ ├── data/faiss_index/ # Vector DB index files
│ ├── main.py # FastAPI entrypoint
│ └── requirements.txt
│
├── frontend/ # React-based chat interface
│ ├── src/
│ │ ├── App.jsx # Main frontend logic
│ │ ├── components/ # Chat UI (ChatInput, ChatMessage, etc.)
│ │ └── index.css
│ └── vite.config.js
│
└── README.md
- Backend: FastAPI, LangChain, FAISS, Google Generative AI
- Frontend: React + Vite
- Extraction: Docling (PDF → structured text)
- Vector DB: FAISS (semantic search)
- LLM: Google Gemini (via LangChain integration)
-
Document Ingestion
- Docling extracts structured knowledge (Acts, Sections, Subsections) from legal PDFs.
- Extracted text is chunked and stored in FAISS Vector DB with embeddings.
-
User Query → Optimized Query
- User inputs a question (e.g., “What are the penalties for late tax filing?”).
- A query rewriting chain expands and optimizes it into more expressive legal queries.
-
Retrieval + Generation
- Optimized query retrieves relevant chunks from FAISS.
- LLM generates an answer strictly from context, with inline citations.
-
Chat Memory
- Session memory allows follow-up questions without losing context.
cd server
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
uvicorn main:app --reloadServer runs on: http://127.0.0.1:8000
cd frontend
npm install
npm run devFrontend runs on: http://127.0.0.1:5173
| Task | Members (Index Numbers) |
|---|---|
| Document Pipeline | 21ug1040, 21ug1287, 21ug1021, 21ug1036, 21ug1066, 21ug1135 |
| Vector Store & Retrieval | 21ug1073, 21ug1287, 21ug1313 |
| LLM Orchestration | 21ug1073, 21ug1287, 21ug0926, 21ug1135 |
| Backend API | 21ug1073, 21ug1287, 21ug0956, 21ug1066 |
| Frontend UX | 21ug1021, 21ug1036, 21ug1073, 21ug1287 |
- “What are the duties of company directors under the Companies Act?”
- “Explain penalties for late filing under Inland Revenue Act.”
- “What are the minimum wage rules in Sri Lanka?”
- Strictly focused on Sri Lankan Business & Corporate Law.
- Out-of-domain queries are handled gracefully (assistant explains and suggests legal ones).
- This is an academic project; not a substitute for professional legal advice.