Skip to content

fix(extractor): make TikaExtractor System.out muting thread-safe and avoid double temp file#163

Merged
marevol merged 3 commits intomasterfrom
fix/extractor-tika-improvements
May 5, 2026
Merged

fix(extractor): make TikaExtractor System.out muting thread-safe and avoid double temp file#163
marevol merged 3 commits intomasterfrom
fix/extractor-tika-improvements

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

  • Replace per-call System.setOut/System.setErr swap with a ref-counted mute/unmute under a class lock so concurrent extractions cannot race and leave the JVM streams permanently redirected.
  • When the deferred output spills to disk, read it back via TikaInputStream.get(Path) instead of letting Tika spool the bytes a second time inside TikaDetectParser.parse.
  • Add setMuteSystemStreams(boolean) (default true) for users who want the original output preserved (e.g., debugging).
  • Tighten error messages to key=value format and stop double-wrapping the bomb-detection ExtractException.

Why

Some Tika-bundled parsers print to System.out/System.err during parsing. The original mute logic is intentional, but the un-synchronized swap was unsafe under concurrent crawls — once two threads raced through the swap, the original streams could be lost. Also, on large inputs the DeferredFileOutputStream spilled to disk and Tika immediately re-spooled into a second temp file (apache-tika-*) on top of the existing tikaExtractor-* staging file.

The new implementation only synchronizes the swap itself, not the extraction work, so concurrent extractions are not serialized. Wrapping the staging file with TikaInputStream.get(Path) lets Tika's internal TikaInputStream.get(stream, tmp, metadata) short-circuit (since the input is already a TikaInputStream), reusing the existing file path.

Tests

  • Concurrent extraction (16 threads x 8 iterations) does not corrupt System.out/System.err.
  • Streams restored on exception (both null-input fast path and broken-stream during-mute path).
  • DFOS-on-disk path no longer creates a second apache-tika-* temp file.
  • DFOS-in-memory path creates no temp file at all.
  • setMuteSystemStreams(false) leaves the streams alone throughout extraction.

Test plan

  • CI green
  • Manual review of mute/unmute ref-count
  • Run test suite under high concurrency locally

marevol added 3 commits May 5, 2026 07:47
…avoid double temp file

- Replace per-call `System.setOut`/`System.setErr` swap with a ref-counted
  mute/unmute under a class-level lock. The previous implementation could
  race under concurrent crawls and leave the JVM streams permanently
  redirected; the new implementation only synchronizes the swap itself
  (not the extraction work), so concurrent extractions are not serialized.
- When the on-disk staging file exists, open it via
  `TikaInputStream.get(Path)` so that Tika's internal `TikaInputStream.get`
  in `TikaDetectParser.parse` reuses the existing file rather than
  spooling the bytes into a second temp file.
- Add `setMuteSystemStreams(boolean)` (default `true`) so callers can opt
  out of muting when debugging.
- Tighten error messages to `key=value` format and avoid double-wrapping
  the bomb-detection ExtractException.

Tests:
- concurrent extractions do not corrupt System.out/System.err
- streams are restored on exception (both pre-mute and during-mute paths)
- on-disk staging path no longer creates a second `apache-tika-*` temp file
- in-memory path creates no temp file
- `setMuteSystemStreams(false)` leaves the streams alone
The PR #163 capture/replay path uses Charset.defaultCharset() because
PrintStream(out, true) wraps the JVM default charset; the configurable
outputEncoding intentionally does not apply there to avoid lossy
substitution of non-ASCII Tika/PDFBox/POI diagnostics. Document this
on the field, the muteSystemStreams/replayCapturedBytes pair, and the
public setter, and lock the contract down with a round-trip test that
sets outputEncoding to ISO-8859-1 and verifies non-ASCII bytes still
replay correctly.
@marevol marevol merged commit 9374c03 into master May 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant