Skip to content

james-helou/RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

PDF RAG Pipeline

A local RAG (Retrieval-Augmented Generation) pipeline that indexes PDFs and answers questions about them.

Working Demo

Run the TF-IDF based demo (no external downloads needed):

python minimal_rag_demo.py

This demonstrates a complete RAG pipeline with your sample PDF!

Features

  • PDF Text Extraction: Extracts and normalizes text from PDFs
  • Smart Chunking: Splits text into overlapping chunks for better context
  • Vector Embeddings: Multiple options (TF-IDF, SentenceTransformers)
  • Similarity Search: Finds relevant content using cosine similarity
  • Multiple Implementations: From simple to advanced

Quick Start

1. Basic Usage (Retrieval Only)

python pdf_qa.py --pdf your_document.pdf

This will:

  • Extract and index the PDF
  • Start an interactive Q&A session
  • Show top matching passages for each question

2. With Local LLM (Answer Generation)

python pdf_qa.py --pdf your_document.pdf --use-llm

Uses a local language model to generate concise answers from retrieved passages.

Command Line Options

Option Default Description
--pdf (required) Path to your PDF file
--db ./pdf_index ChromaDB persistence directory
--collection pdf_qa Collection name in ChromaDB
--reindex False Force re-indexing the PDF
--chunk 900 Chunk size in characters
--overlap 150 Overlap between chunks
--topk 4 Number of passages to retrieve
--use-llm False Use local LLM for answers
--model-id microsoft/Phi-3-mini-4k-instruct HuggingFace model ID

Examples

Index and query a PDF

python pdf_qa.py --pdf research_paper.pdf

Query with more context

python pdf_qa.py --pdf manual.pdf --topk 8

Force re-indexing

python pdf_qa.py --pdf document.pdf --reindex

Use with custom chunking

python pdf_qa.py --pdf large_doc.pdf --chunk 1500 --overlap 200

How It Works

  1. Text Extraction: Reads PDF pages using pypdf
  2. Chunking: Splits text into overlapping chunks for better context
  3. Embedding: Converts chunks to vectors using all-MiniLM-L6-v2
  4. Storage: Saves embeddings in ChromaDB (persistent)
  5. Retrieval: Finds most relevant chunks using cosine similarity
  6. Generation (optional): Uses local LLM to synthesize answers

Simple Example (mini_rag.py)

A minimal example showing the core concepts:

python mini_rag.py

This demonstrates RAG basics with tiny in-memory documents.

Tips

  • First run downloads the embedding model (~80MB) - this is cached
  • Using --use-llm downloads a larger model (~7GB for Phi-3)
  • Indexed PDFs persist in the --db directory for reuse
  • Adjust --chunk and --overlap for different document types
  • Lower --topk for focused answers, higher for broader context

Requirements

See requirements in the code:

  • pypdf
  • sentence-transformers
  • chromadb
  • transformers
  • torch

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages