PDF RAG Pipeline

A local RAG (Retrieval-Augmented Generation) pipeline that indexes PDFs and answers questions about them.

Working Demo

Run the TF-IDF based demo (no external downloads needed):

python minimal_rag_demo.py

This demonstrates a complete RAG pipeline with your sample PDF!

Features

PDF Text Extraction: Extracts and normalizes text from PDFs
Smart Chunking: Splits text into overlapping chunks for better context
Vector Embeddings: Multiple options (TF-IDF, SentenceTransformers)
Similarity Search: Finds relevant content using cosine similarity
Multiple Implementations: From simple to advanced

Quick Start

1. Basic Usage (Retrieval Only)

python pdf_qa.py --pdf your_document.pdf

This will:

Extract and index the PDF
Start an interactive Q&A session
Show top matching passages for each question

2. With Local LLM (Answer Generation)

python pdf_qa.py --pdf your_document.pdf --use-llm

Uses a local language model to generate concise answers from retrieved passages.

Command Line Options

Option	Default	Description
`--pdf`	(required)	Path to your PDF file
`--db`	`./pdf_index`	ChromaDB persistence directory
`--collection`	`pdf_qa`	Collection name in ChromaDB
`--reindex`	False	Force re-indexing the PDF
`--chunk`	900	Chunk size in characters
`--overlap`	150	Overlap between chunks
`--topk`	4	Number of passages to retrieve
`--use-llm`	False	Use local LLM for answers
`--model-id`	`microsoft/Phi-3-mini-4k-instruct`	HuggingFace model ID

Examples

Index and query a PDF

python pdf_qa.py --pdf research_paper.pdf

Query with more context

python pdf_qa.py --pdf manual.pdf --topk 8

Force re-indexing

python pdf_qa.py --pdf document.pdf --reindex

Use with custom chunking

python pdf_qa.py --pdf large_doc.pdf --chunk 1500 --overlap 200

How It Works

Text Extraction: Reads PDF pages using pypdf
Chunking: Splits text into overlapping chunks for better context
Embedding: Converts chunks to vectors using all-MiniLM-L6-v2
Storage: Saves embeddings in ChromaDB (persistent)
Retrieval: Finds most relevant chunks using cosine similarity
Generation (optional): Uses local LLM to synthesize answers

Simple Example (mini_rag.py)

A minimal example showing the core concepts:

python mini_rag.py

This demonstrates RAG basics with tiny in-memory documents.

Tips

First run downloads the embedding model (~80MB) - this is cached
Using --use-llm downloads a larger model (~7GB for Phi-3)
Indexed PDFs persist in the --db directory for reuse
Adjust --chunk and --overlap for different document types
Lower --topk for focused answers, higher for broader context

Requirements

See requirements in the code:

pypdf
sentence-transformers
chromadb
transformers
torch

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
pdf_qa.py		pdf_qa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF RAG Pipeline

Working Demo

Features

Quick Start

1. Basic Usage (Retrieval Only)

2. With Local LLM (Answer Generation)

Command Line Options

Examples

Index and query a PDF

Query with more context

Force re-indexing

Use with custom chunking

How It Works

Simple Example (mini_rag.py)

Tips

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF RAG Pipeline

Working Demo

Features

Quick Start

1. Basic Usage (Retrieval Only)

2. With Local LLM (Answer Generation)

Command Line Options

Examples

Index and query a PDF

Query with more context

Force re-indexing

Use with custom chunking

How It Works

Simple Example (mini_rag.py)

Tips

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages