PDF Data Extraction and Processing

This project is a Python application that extracts and processes data from PDF files. It uses image processing techniques to correct the perspective of the images and OCR (Optical Character Recognition) to extract the data.

Features

Convert PDF files to images
Correct the perspective of the images
Extract data from the images using OCR
Save the extracted data to an Excel file

Requirements

Python 3.8 or higher
OpenCV
PyTesseract
Openpyxl
pdf2image
poppler-utils

Installation

Clone the repository:

git clone https://github.com/sacaaa/pdf-to-excel

Install the requirements:

pip install opencv-python
pip install pytesseract
pip install openpyxl
pip install pdf2image

Download the Poppler and Tesseract OCR libraries from here and paste them into the assets folder.

Usage

Place your PDF files in the input_pdf directory and update the COORDINATES dictionary as your PDF.
Run the script:

python main.py

The processed data will be saved in the output_excel directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Data Extraction and Processing

Features

Requirements

Installation

Usage

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
input_pdf		input_pdf
output_excel		output_excel
README.md		README.md
main.py		main.py

sacaaa/pdf-to-excel

Folders and files

Latest commit

History

Repository files navigation

PDF Data Extraction and Processing

Features

Requirements

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages