Skip to content

Address problems with colour conversion in ocrmypdf for certain PDFs #679

@philmcmahon

Description

@philmcmahon

We've seen 13 instances of PDFs with a colour space that causes the below issue. I think if we wanted to deal with this we would have to run ocrmypdf, detect this type of failure and then if it happens try again with --color-conversion-strategy set to RGB or something

However, 13 is not that many so maybe this isn't a priority

Error in 'OcrMyPdfExtractor processing ArbVREi_dU0nQSH-GoRwgq1dRqSEXvtVh_QiycqaCKmADPSJnbgjf3bCWI6l3KZdx1Mas2U_jVveS5j1nofZTw': java.lang.IllegalStateException: OcrMyPdfExtractor error openjpeg warning: unspec CS. 1 component so assuming gray.
openjpeg warning: unspec CS. 1 component so assuming gray.
openjpeg warning: unspec CS. 1 component so assuming gray.
Start processing 2 pages concurrently
    1 redoing OCR
    2 redoing OCR
    3 redoing OCR
    4 redoing OCR
Postprocessing...
ColorConversionNeededError: The input PDF has an unusual color space. Use
--color-conversion-strategy to convert to a common color space
such as RGB, or use --output-type pdf to skip PDF/A conversion
and retain the original color space.

	at extraction.ocr.BaseOcrExtractor.extract(BaseOcrExtractor.scala:36)
	at extraction.FileExtractor.extract(FileExtractor.scala:22)
	at extraction.Worker.safeInvokeExtractor(Worker.scala:150)
	at extraction.Worker.$anonfun$executeBatch$3(Worker.scala:94)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions