This project implements a Multimodal Retrieval-Augmented Generation (RAG) system that extracts and processes text, tables, and images (including graphs and charts) from PDF documents. It generates captions for images using OpenAI’s GPT-4o, builds a vector index over all extracted content, and answers user queries by retrieving relevant multimodal information.
- Extracts text content from PDFs using LangChain’s
PyPDFLoader. - Extracts tables from PDFs using
pdfplumberand converts them into textual documents. - Extracts images from PDFs using PyMuPDF (
fitz), saves them locally, and generates captions describing charts, plots, or images with OpenAI GPT-4o. - Combines text, tables, and image captions into a unified document corpus.
- Creates a FAISS vector index with OpenAI embeddings for efficient retrieval.
- Uses a retrieval-augmented generation chain with GPT-4o to answer queries based on multimodal PDF content.
- Specifically designed to understand and explain graphical data and visual elements within PDFs.
pip install langchain openai pdfplumber pymupdf pillow faiss-cpu python-dotenvCreate a .env file in the project root with your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key_here
- Put your PDF file in the project directory and update the pdf_file variable in the script.
- Run the script:
python main.py- The script will:
- Extract text, tables, and images with captions from the PDF.
- Index all extracted documents.
- Answer queries such as "What do the graphs show in this PDF?".
- Print the generated answer.
For any questions or clarifications, please contact Raza Mehar at [[email protected]].