A local RAG (Retrieval-Augmented Generation) pipeline that indexes PDFs and answers questions about them.
Run the TF-IDF based demo (no external downloads needed):
python minimal_rag_demo.pyThis demonstrates a complete RAG pipeline with your sample PDF!
- PDF Text Extraction: Extracts and normalizes text from PDFs
- Smart Chunking: Splits text into overlapping chunks for better context
- Vector Embeddings: Multiple options (TF-IDF, SentenceTransformers)
- Similarity Search: Finds relevant content using cosine similarity
- Multiple Implementations: From simple to advanced
python pdf_qa.py --pdf your_document.pdfThis will:
- Extract and index the PDF
- Start an interactive Q&A session
- Show top matching passages for each question
python pdf_qa.py --pdf your_document.pdf --use-llmUses a local language model to generate concise answers from retrieved passages.
| Option | Default | Description |
|---|---|---|
--pdf |
(required) | Path to your PDF file |
--db |
./pdf_index |
ChromaDB persistence directory |
--collection |
pdf_qa |
Collection name in ChromaDB |
--reindex |
False | Force re-indexing the PDF |
--chunk |
900 | Chunk size in characters |
--overlap |
150 | Overlap between chunks |
--topk |
4 | Number of passages to retrieve |
--use-llm |
False | Use local LLM for answers |
--model-id |
microsoft/Phi-3-mini-4k-instruct |
HuggingFace model ID |
python pdf_qa.py --pdf research_paper.pdfpython pdf_qa.py --pdf manual.pdf --topk 8python pdf_qa.py --pdf document.pdf --reindexpython pdf_qa.py --pdf large_doc.pdf --chunk 1500 --overlap 200- Text Extraction: Reads PDF pages using pypdf
- Chunking: Splits text into overlapping chunks for better context
- Embedding: Converts chunks to vectors using
all-MiniLM-L6-v2 - Storage: Saves embeddings in ChromaDB (persistent)
- Retrieval: Finds most relevant chunks using cosine similarity
- Generation (optional): Uses local LLM to synthesize answers
A minimal example showing the core concepts:
python mini_rag.pyThis demonstrates RAG basics with tiny in-memory documents.
- First run downloads the embedding model (~80MB) - this is cached
- Using
--use-llmdownloads a larger model (~7GB for Phi-3) - Indexed PDFs persist in the
--dbdirectory for reuse - Adjust
--chunkand--overlapfor different document types - Lower
--topkfor focused answers, higher for broader context
See requirements in the code:
- pypdf
- sentence-transformers
- chromadb
- transformers
- torch