This project is a tool for downloading and managing OCR datasets, combining online and local sources. It supports the creation of training data for text detection and text recognition for various OCR tools. This project offers 2 types of operation and they are data-generation and label-drawing.
The goal of this project is to speed up the creation of training data by converting all data into a single format; this tool will then take care of transforming it into the format required by the various OCR systems.
To add more datasets or generators, read the instructions below:
doctr: https://github.com/mindee/doctreasyocr: https://github.com/JaidedAI/EasyOCRmmocr: https://github.com/open-mmlab/mmocrpaddleocr: https://github.com/PaddlePaddle/PaddleOCRyolo+trocr: YOLO, TrOCR
CORD: https://paperswithcode.com/dataset/cordFUNSD: https://guillaumejaume.github.io/FUNSD/GNHK: https://www.goodnotes.com/gnhkIAM: https://fki.tic.heia-fr.ch/databases/iam-handwriting-database (Coming Soon)SROIE: https://paperswithcode.com/paper/icdar2019-competition-on-scanned-receipt-ocrWILDRECEIPT: https://paperswithcode.com/dataset/wildreceiptXFUND: https://github.com/doc-analysis/XFUND (de,es,fr,it,ja,pt,zh)
# clone the repository
git clone https://github.com/xReniar/OCR-Dataset-Generator.git
# install the requirements:
cd OCR-Dataset-Generator
pip3 install -r requirements.txtBelow are the instructions on how to properly use this project. Each process requires correct configuration of the pipeline.yaml file,
where you can specify key parameters such as the datasets to use, the character dictionary (dict), the tasks to perform, and more.
Interactive examples (dropdown menus) are provided to guide through the configuration, along with the commands needed to launch each process.
Note
If the process does not start check error-checking.
This process draws bounding boxes on images using annotation files. The annotations are loaded from the labels folder of each dataset in ./data. Before executing the drawing process, verify the following parameters in ./pipeline.yaml:
datasets: Selected dataset directories (relative to ./data) to process (to select a dataset set it toy). Dataset not present in the./datafolder will be downloaded first.draw-process:color: Bounding box color in BGR format (e.g., [255, 0, 0] for red).thickness: Line width (in pixels).
dict: Path to a.txtfile containing allowed characters. It acts like a filter, if the text does not contain any of the characters specified in the.txtfile then the associated bounding box will not be drawn. If this field is left empty then all the bounding box are drawn.workers: Number of parallel threads for processing (recommended 4, depends on the numbers of cores)
Tip
Example
Drawing labels of CORD dataset with black bounding boxes with thickness 2, using en_dict.txt:
draw-process:
color: [0, 0, 0]
thickness: 2
dict: ./dict/en_dict.txt
workers: 8
# it's possible to draw bounding boxes for multiple datasets
# just set to 'y' the dataset needed
datasets:
cord: y
....To start the drawing process run this command:
python main.py --drawIf the process terminates correctly a cord folder (the name depends on the selected dataset) will appear inside ./draw.
This process generates training data for the specified ocr-tool, the annotations are loaded from the labels folder of each dataset selected. Before generating the training data verify the following parameters in ./pipeline.yaml:
test-name: Name identifier for this training data generation processocr-system: The OCR system being used for training data generation, the possible choices are listed hereaugmentation: Whether data augmentation is applied (True of False), check data augmentationtasks: Task for the training data, set toythe necessary tasksdict: Path to the dictionary file being used for the training data. It acts like a filter depending on the task (if left empty then all the bounding boxes and text will be included):- for the
detectiontask the bounding box will not be included in the generated data - for the
recognitiontask the text will not be included in the generated data
- for the
workers: Number of parallel threads for processing (depends on the numbers of cores)datasets: Selected dataset directories (relative to ./data) to use for the training-data generation (to select a dataset set it toy). Dataset not present in the./datafolder will be downloaded first.
Tip
Example
Generating training data with CORD and SROIE dataset for paddleocr. The data is for text detection and text recognition using en_dict.txt (No augmentation applied). The training data name is example-test:
test-name: example-test
ocr-system: paddleocr
augmentation: false
tasks:
det: y
rec: y
dict: ./dict/en_dict.txt
workers: 8
datasets:
cord: y
sroie: yTo start the generation process run this command:
python main.py --generateIf the process terminates correctly then an output folder will appear (read here for instructions on how to use the training-data):
.
└── output
└── example-test-paddleocr
├── Detection
│ └── ....
└── Recognition
└── ....
Before generating the training data or drawing the labels there is an error-checking step, which basically checks for missing labels or missing images or wrong bounding box coordinates. If there are some errors a ./errors.json file will be created with this structure:
{
"dataset-name" {
"missing_images": [],
"missing_labels": [],
"label_checking": {
"path/to/label.txt" {
"line": 34,
"text": "text",
"bbox": []
}
}
}
}missing_images: contains the names of label files that do not have a corresponding image file in theimagesfolder.missing_labels: contains the names of images that do not have a corresponding label file in thelabelsfolder.label_checking: set of objects where the key is the path to the.txtfile:line: line of the.txtwhere the bounding box is wrongtext: text associated to the wrong bounding boxbbox: values of the bounding box
The data augmentation relies on Albumentations, check ./src/augmenter.py to add more augmentations. By default only the blur operation is applied if augmentation is set to True:
self.transforms = {
"blur": A.Blur(
blur_limit=7,
p=1.0 # notice the probability "p" set to 1.0
)
}This means that for each image in the generated data there is another image with the blur operation applied (img_1.png, img_1_blur.png, img_2.png, img_2_blur.png, etc..). To add more operations for example a skew do this:
self.transforms = {
"blur": A.Blur(
blur_limit=7,
p=1.0
),
"skew": A.Affine(
shear={"x": (-15, 15), "y": (-10, 10)},
rotate=(-5, 5),
scale=(0.9, 1.1),
keep_ratio=True,
p=0.7
)
}This means for each image there will be a img_blur.png and a img_skew.png in the training data.
Warning
Pay attention to some operation because it will create empty images, when it happens a cv2 error message will appear.
- Modify dataset to manage rotated text (?)
