Conversation
…avoid double temp file - Replace per-call `System.setOut`/`System.setErr` swap with a ref-counted mute/unmute under a class-level lock. The previous implementation could race under concurrent crawls and leave the JVM streams permanently redirected; the new implementation only synchronizes the swap itself (not the extraction work), so concurrent extractions are not serialized. - When the on-disk staging file exists, open it via `TikaInputStream.get(Path)` so that Tika's internal `TikaInputStream.get` in `TikaDetectParser.parse` reuses the existing file rather than spooling the bytes into a second temp file. - Add `setMuteSystemStreams(boolean)` (default `true`) so callers can opt out of muting when debugging. - Tighten error messages to `key=value` format and avoid double-wrapping the bomb-detection ExtractException. Tests: - concurrent extractions do not corrupt System.out/System.err - streams are restored on exception (both pre-mute and during-mute paths) - on-disk staging path no longer creates a second `apache-tika-*` temp file - in-memory path creates no temp file - `setMuteSystemStreams(false)` leaves the streams alone
The PR #163 capture/replay path uses Charset.defaultCharset() because PrintStream(out, true) wraps the JVM default charset; the configurable outputEncoding intentionally does not apply there to avoid lossy substitution of non-ASCII Tika/PDFBox/POI diagnostics. Document this on the field, the muteSystemStreams/replayCapturedBytes pair, and the public setter, and lock the contract down with a round-trip test that sets outputEncoding to ISO-8859-1 and verifies non-ASCII bytes still replay correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
System.setOut/System.setErrswap with a ref-counted mute/unmute under a class lock so concurrent extractions cannot race and leave the JVM streams permanently redirected.TikaInputStream.get(Path)instead of letting Tika spool the bytes a second time insideTikaDetectParser.parse.setMuteSystemStreams(boolean)(defaulttrue) for users who want the original output preserved (e.g., debugging).key=valueformat and stop double-wrapping the bomb-detectionExtractException.Why
Some Tika-bundled parsers print to
System.out/System.errduring parsing. The original mute logic is intentional, but the un-synchronized swap was unsafe under concurrent crawls — once two threads raced through the swap, the original streams could be lost. Also, on large inputs theDeferredFileOutputStreamspilled to disk and Tika immediately re-spooled into a second temp file (apache-tika-*) on top of the existingtikaExtractor-*staging file.The new implementation only synchronizes the swap itself, not the extraction work, so concurrent extractions are not serialized. Wrapping the staging file with
TikaInputStream.get(Path)lets Tika's internalTikaInputStream.get(stream, tmp, metadata)short-circuit (since the input is already aTikaInputStream), reusing the existing file path.Tests
System.out/System.err.apache-tika-*temp file.setMuteSystemStreams(false)leaves the streams alone throughout extraction.Test plan