This is the Tika backend for mu-search. It is to be used in conjunction with that microservice. Please see the README of mu-search in order to use this component.
This image is based on Tika and includes the components which are commonly expected to be available for mu-search.
Add the Tika service to docker-compose.yml
services:
tika:
image: semtech/mu-search-tika-backend:1.0.0
Restart the stack using docker compose up -d
. The tika
service will be created.
Generate a default Tika config
mu script tika generate-config
Mount the generated config folder in /config
services:
tika:
image: semtech/mu-search-tika-backend:1.0.0
volumes:
- ./config/tika:/config
(Re)create the service using docker compose up -d tika
.
To increase the Java heap space, mount a custom config tika-config.xml
and config the server's forkedJvmArgs
<properties>
<server>
<params>
<forkedJvmArgs>
<arg>-Xms4g</arg>
<arg>-Xmx4g</arg>
</forkedJvmArgs>
</params>
</server>
</properties>
To disable the OCR parser, mount a custom config tika-config.xml
containing the following content
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
</parsers>
</properties>
Overwrite the default Tika config by mounting a folder containing tika-config.xml
in /config
as explained in 'How to customize the Tika config'.
All config options are documented in the official Tika documentation.
As of v2 Tika has out-of-the-box support for performing automatic OCR on PDF documents. The official Tika Docker images provides support for:
- English
- French
- German
- Italian
- Spanish
- Japanese
This image additionally adds support for:
- Dutch