This project is a Python application that extracts and processes data from PDF files. It uses image processing techniques to correct the perspective of the images and OCR (Optical Character Recognition) to extract the data.
- Convert PDF files to images
- Correct the perspective of the images
- Extract data from the images using OCR
- Save the extracted data to an Excel file
- Python 3.8 or higher
- OpenCV
- PyTesseract
- Openpyxl
- pdf2image
- poppler-utils
- Clone the repository:
git clone https://github.com/sacaaa/pdf-to-excel- Install the requirements:
pip install opencv-python
pip install pytesseract
pip install openpyxl
pip install pdf2imageDownload the Poppler and Tesseract OCR libraries from here and paste them into the assets folder.
- Place your PDF files in the
input_pdfdirectory and update theCOORDINATESdictionary as your PDF. - Run the script:
python main.py- The processed data will be saved in the
output_exceldirectory.