I had been using pdfsandwich to
create searchable PDFs from non-searchable PDFs. However, it's a pain to collect
all the dependencies if e.g. you don't have root access. So I thought to package
them up with Julia's BinaryBuilder to make installation simple. However, I
wasn't able to cross-compile pdfsandwich itself. But since tesseract is doing
the hard work anyway, I thought I would just write the glue script myself. It
turns out there are several of
these
already.
I believe I have likely diverged from the pdfsandwich implementation since I
haven't used ImageMagick's convert which is one of the dependencies of
pdfsandwich. Since the job can be done very simply, e.g.
- convert each page of the PDF to an image
- possibly clean it up with
unpaper - use tesseract to create a single-page searchable PDF
- combine the PDFs,
I decided to not look at the source of pdfsandwich when creating my implementation so I can stick to an MIT
license, which is the usual one in the Julia community.
It more-or-less works on MacOS (both Intel and Apple Silicon) and Linux.
Next steps:
- Allow choice of training data used for tesseract
- Robustify and test on more files
- Add better tests?
using SearchablePDFs
file = ocr("test/test_rasterized.pdf")Supports @main and on v1.12 an app searchable-pdf.
If you use juliaup you can install 1.12 with juliaup add nightly, then run
JULIA_LOAD_PATH="@:@stdlib" julia +nightly --startup-file=no -e 'using Pkg; Pkg.activate(temp=true); Pkg.Apps.add(url="https://github.com/ericphanson/SearchablePDFs.jl")'to install a CLI executable searchable-pdf to the bin directory in your Julia depot (~/.julia by default). You will likely need to add your bin directory to your PATH, e.g.
export PATH="/Users/eph/.julia/bin:$PATH"which can go in a shell startup script (e.g. ~/.bashrc or ~/.zshrc).
You can re-run this command to update it.