Skip to content

This project is a Python application that extracts and processes data from PDF files. It uses image processing techniques to correct the perspective of the images and OCR (Optical Character Recognition) to extract the data.

Notifications You must be signed in to change notification settings

sacaaa/pdf-to-excel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Data Extraction and Processing

This project is a Python application that extracts and processes data from PDF files. It uses image processing techniques to correct the perspective of the images and OCR (Optical Character Recognition) to extract the data.

Features

  • Convert PDF files to images
  • Correct the perspective of the images
  • Extract data from the images using OCR
  • Save the extracted data to an Excel file

Requirements

  • Python 3.8 or higher
  • OpenCV
  • PyTesseract
  • Openpyxl
  • pdf2image
  • poppler-utils

Installation

  1. Clone the repository:
git clone https://github.com/sacaaa/pdf-to-excel
  1. Install the requirements:
pip install opencv-python
pip install pytesseract
pip install openpyxl
pip install pdf2image

Download the Poppler and Tesseract OCR libraries from here and paste them into the assets folder.

Usage

  1. Place your PDF files in the input_pdf directory and update the COORDINATES dictionary as your PDF.
  2. Run the script:
python main.py
  1. The processed data will be saved in the output_excel directory.

About

This project is a Python application that extracts and processes data from PDF files. It uses image processing techniques to correct the perspective of the images and OCR (Optical Character Recognition) to extract the data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages