Skip to content

Training data generator for Text Detection and Text Recognition for docTR, EasyOCR, MMOCR, PaddleOCR and other OCR tools. Offers support for data augmentation and label drawing.

Notifications You must be signed in to change notification settings

xReniar/OCR-Dataset-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR dataset generator

This project is a tool for downloading and managing OCR datasets, combining online and local sources. It supports the creation of training data for text detection and text recognition for various OCR tools. This project offers 2 types of operation and they are data-generation and label-drawing.

The goal of this project is to speed up the creation of training data by converting all data into a single format; this tool will then take care of transforming it into the format required by the various OCR systems.

To add more datasets or generators, read the instructions below:

Supported OCR tools

Available datasets

Setup

# clone the repository
git clone https://github.com/xReniar/OCR-Dataset-Generator.git

# install the requirements:
cd OCR-Dataset-Generator
pip3 install -r requirements.txt

How to use

Below are the instructions on how to properly use this project. Each process requires correct configuration of the pipeline.yaml file, where you can specify key parameters such as the datasets to use, the character dictionary (dict), the tasks to perform, and more. Interactive examples (dropdown menus) are provided to guide through the configuration, along with the commands needed to launch each process.

Note

If the process does not start check error-checking.

Bounding Box Drawing Process

This process draws bounding boxes on images using annotation files. The annotations are loaded from the labels folder of each dataset in ./data. Before executing the drawing process, verify the following parameters in ./pipeline.yaml:

  • datasets: Selected dataset directories (relative to ./data) to process (to select a dataset set it to y). Dataset not present in the ./data folder will be downloaded first.
  • draw-process:
    • color: Bounding box color in BGR format (e.g., [255, 0, 0] for red).
    • thickness: Line width (in pixels).
  • dict: Path to a .txt file containing allowed characters. It acts like a filter, if the text does not contain any of the characters specified in the .txt file then the associated bounding box will not be drawn. If this field is left empty then all the bounding box are drawn.
  • workers: Number of parallel threads for processing (recommended 4, depends on the numbers of cores)

Tip

Example

Drawing labels of CORD dataset with black bounding boxes with thickness 2, using en_dict.txt:

draw-process:
    color: [0, 0, 0]
    thickness: 2

dict: ./dict/en_dict.txt
workers: 8

# it's possible to draw bounding boxes for multiple datasets
# just set to 'y' the dataset needed
datasets:
    cord: y
    ....

To start the drawing process run this command:

python main.py --draw

If the process terminates correctly a cord folder (the name depends on the selected dataset) will appear inside ./draw.

Training data Generation Process

This process generates training data for the specified ocr-tool, the annotations are loaded from the labels folder of each dataset selected. Before generating the training data verify the following parameters in ./pipeline.yaml:

  • test-name: Name identifier for this training data generation process
  • ocr-system: The OCR system being used for training data generation, the possible choices are listed here
  • augmentation: Whether data augmentation is applied (True of False), check data augmentation
  • tasks: Task for the training data, set to y the necessary tasks
  • dict: Path to the dictionary file being used for the training data. It acts like a filter depending on the task (if left empty then all the bounding boxes and text will be included):
    • for the detection task the bounding box will not be included in the generated data
    • for the recognition task the text will not be included in the generated data
  • workers: Number of parallel threads for processing (depends on the numbers of cores)
  • datasets: Selected dataset directories (relative to ./data) to use for the training-data generation (to select a dataset set it to y). Dataset not present in the ./data folder will be downloaded first.

Tip

Example

Generating training data with CORD and SROIE dataset for paddleocr. The data is for text detection and text recognition using en_dict.txt (No augmentation applied). The training data name is example-test:

test-name: example-test
ocr-system: paddleocr
augmentation: false

tasks:
    det: y
    rec: y

dict: ./dict/en_dict.txt
workers: 8

datasets:
    cord: y
    sroie: y

To start the generation process run this command:

python main.py --generate

If the process terminates correctly then an output folder will appear (read here for instructions on how to use the training-data):

.
└── output
    └── example-test-paddleocr
        ├── Detection
        │   └── ....
        └── Recognition
            └── ....

Error checking

Before generating the training data or drawing the labels there is an error-checking step, which basically checks for missing labels or missing images or wrong bounding box coordinates. If there are some errors a ./errors.json file will be created with this structure:

{
    "dataset-name" {
        "missing_images": [],
        "missing_labels": [],
        "label_checking": {
            "path/to/label.txt" {
                "line": 34, 
                "text": "text",
                "bbox": []
            }
        }
    }
}
  • missing_images: contains the names of label files that do not have a corresponding image file in the images folder.
  • missing_labels: contains the names of images that do not have a corresponding label file in the labels folder.
  • label_checking: set of objects where the key is the path to the .txt file:
    • line: line of the .txt where the bounding box is wrong
    • text: text associated to the wrong bounding box
    • bbox: values of the bounding box

Data Augmentation

The data augmentation relies on Albumentations, check ./src/augmenter.py to add more augmentations. By default only the blur operation is applied if augmentation is set to True:

self.transforms = {
    "blur": A.Blur(
        blur_limit=7,
        p=1.0 # notice the probability "p" set to 1.0
    )
}

This means that for each image in the generated data there is another image with the blur operation applied (img_1.png, img_1_blur.png, img_2.png, img_2_blur.png, etc..). To add more operations for example a skew do this:

self.transforms = {
    "blur": A.Blur(
        blur_limit=7,
        p=1.0
    ),
    "skew": A.Affine(
        shear={"x": (-15, 15), "y": (-10, 10)},
        rotate=(-5, 5),
        scale=(0.9, 1.1),
        keep_ratio=True,
        p=0.7
    )
}

This means for each image there will be a img_blur.png and a img_skew.png in the training data.

Warning

Pay attention to some operation because it will create empty images, when it happens a cv2 error message will appear.

Future developments

  • Modify dataset to manage rotated text (?)

About

Training data generator for Text Detection and Text Recognition for docTR, EasyOCR, MMOCR, PaddleOCR and other OCR tools. Offers support for data augmentation and label drawing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages