mu-search-tika-backend

This is the Tika backend for mu-search. It is to be used in conjunction with that microservice. Please see the README of mu-search in order to use this component.

This image is based on Tika and includes the components which are commonly expected to be available for mu-search.

Tutorials

Add Tika to your stack

Add the Tika service to docker-compose.yml

services:
  tika:
    image: semtech/mu-search-tika-backend:1.0.0

Restart the stack using docker compose up -d. The tika service will be created.

How-to guides

How to customize the Tika config

Generate a default Tika config

mu script tika generate-config

Mount the generated config folder in /config

services:
  tika:
    image: semtech/mu-search-tika-backend:1.0.0
    volumes:
    - ./config/tika:/config

(Re)create the service using docker compose up -d tika.

How to increase the Java heap space

To increase the Java heap space, mount a custom config tika-config.xml and config the server's forkedJvmArgs

<properties>
  <server>
    <params>
      <forkedJvmArgs>
        <arg>-Xms4g</arg>
        <arg>-Xmx4g</arg>
       </forkedJvmArgs>
    </params>
  </server>
</properties>

How to disable the OCR parser

To disable the OCR parser, mount a custom config tika-config.xml containing the following content

<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>

Reference

Tika configuration

Overwrite the default Tika config by mounting a folder containing tika-config.xml in /config as explained in 'How to customize the Tika config'.

All config options are documented in the official Tika documentation.

OCR support

As of v2 Tika has out-of-the-box support for performing automatic OCR on PDF documents. The official Tika Docker images provides support for:

English
French
German
Italian
Spanish
Japanese

This image additionally adds support for:

Dutch

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.woodpecker		.woodpecker
scripts		scripts
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
tika-config.xml		tika-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mu-search-tika-backend

Tutorials

Add Tika to your stack

How-to guides

How to customize the Tika config

How to increase the Java heap space

How to disable the OCR parser

Reference

Tika configuration

OCR support

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

mu-semtech/mu-search-tika-backend

Folders and files

Latest commit

History

Repository files navigation

mu-search-tika-backend

Tutorials

Add Tika to your stack

How-to guides

How to customize the Tika config

How to increase the Java heap space

How to disable the OCR parser

Reference

Tika configuration

OCR support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages