Skip to content

Releases: itext/itext-pdfocr-java

pdfOCR 4.1.0

02 Sep 12:22
4.1.0
Compare
Choose a tag to compare

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

This release of pdfOCR brings a huge change with a new built-in OCR engine. It adds the pdfocr-onnxtr module, which implements the OnnxTR library for OCR tasks, with specific requirements for model predictors and resource management. It significantly improves recognition accuracy for English text, and other Latin-based languages.

The Open Neural Network Exchange (ONNX) is an open standard format for machine learning models, enabling interoperability across various frameworks and tools. OnnxTR is a Python OCR library which is a wrapper around the popular OCR tool doctr, enhanced with support for ONNX models.

It makes OCR processing faster and more accessible by leveraging optimized ONNX models without requiring heavy frameworks. This allows easy integration of OCR into applications with minimal resource consumption and high processing speed, offering fast processing and support for multiple platforms, with features like modularity and lightweight dependencies. Using the existing pdfOCR API, we’ve simply added another OCR engine to the existing pdfOcr-tesseract4 module

Not only that, but pdfOCR now directly supports PDF as input files. This can be a big benefit for OCR workflows, as it removes the need to process PDFs with iText Core to extract images from scanned documents.

You can find full details linked from the release notes on the iText Knowledge Base.

pdfOCR 4.0.2

15 May 11:08
4.0.2
Compare
Choose a tag to compare

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

There are no feature changes for this release. The only changes are to maintain compatibility with the iText Core 9.2.0 dependencies.

pdfOCR 4.0.1

14 Feb 13:09
4.0.1
Compare
Choose a tag to compare

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

This version improves memory usage when using the Tesseract 4 engine for OCR text extraction.

Bug fixes

  • Improved memory usage

pdfOCR 4.0.0

18 Nov 09:58
4.0.0
Compare
Choose a tag to compare

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

For this release the version number has been bumped for compatibility with iText Core 9.0 and License Key Library 4.2.0.

In addition, it includes a fix for CVE-2024-47554 resulting from the use of the Apache Commons.io library. This was resolved by updating to version 2.14.0 from 2.11.0.

Bug fixes

pdfOCR 3.0.2

07 Feb 14:31
3.0.2
Compare
Choose a tag to compare

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

In this release we’ve added support for pdfOCR to be able to intelligently recognize table data and convert it into the correct tag structure in the resulting PDF documents.

A bug for the incorrect font size being selected for particularly small text was also fixed.

New features

  • Table recognition support

Bug fixes

  • Incorrect font size for small text in the PDFs generated with pdfOCR

pdfOCR 3.0.1

25 Oct 15:08
3.0.1
Compare
Choose a tag to compare

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

For this release, the artifact names have been changed to reflect the new naming structure. In addition, since Bouncy Castle is a dependency for tests the .NET version has been updated to use the latest 2.2.1 version.

Improvements

  • Updated .NET Bouncy Castle dependency to 2.2.1

pdfOCR 3.0.0

10 May 12:43
3.0.0
Compare
Choose a tag to compare

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

For this release, an incompatibility issue when using JDK19 and the Leptonica library which could result in a MethodTooLargeException has now been resolved. Otherwise, this release is for compatibility with the iText Core version 8.x.x release.

Bug fixes

Resolved incompatibility issue with JDK19 and Leptonica library.

pdfOCR 2.0.2

25 Oct 10:01
2.0.2
Compare
Choose a tag to compare

For this release of our OCR add-on for iText 7, we have upgraded the underlying tess4j library to version 4.6.0, which uses version 4.1.3 of the Tesseract OCR engine and version 1.82.0 of the Leptonica image processing and analysis library.

A small note for users encountering a MethodTooLargeException with JDK19 and pdfOCR; there is currently an issue with the Leptonica library and JDK19. See this issue for more information and a possible solution.

Improvements

  • Updated tess4j:tess4j from 4.5.5 to 4.6.0, which pushes the following upgrades (Tesseract 4.1.3 (f38e7a7) & Leptonica 1.82.0 (lept4j-1.16.1))

pdfOCR 2.0.1

11 Jan 13:29
2.0.1
Compare
Choose a tag to compare

This maintenance release updates the underlying glue (tess4j) with Tesseract to 4.5.5. There is not much to write home about, but we want to keep track of these underlying versions updates so we are ready for when bigger changes come about.

Improvements

  • Upgrade tesseract up to 4.5.5

pdfOCR 2.0.0

25 Oct 14:06
2.0.0
Compare
Choose a tag to compare

The pdfOCR 2.0.0 release brings the support of the new Unified License Mechanism along with the other products in the iText 7 Suite, and removes some deprecated API methods.

As the icing on the cake though, it benefits from all the improvements featured in iText 7 Core 7.2.0.

Breaking Changes

  • Removed deprecated methods from API

New Features

  • Unified License Mechanism