fix(extractor): rewrite PDF extraction cancellation with ExecutorService#158
Merged
fix(extractor): rewrite PDF extraction cancellation with ExecutorService#158
Conversation
…tion cancellation The previous PdfExtractor.getText() relied on a hand-rolled worker thread plus a 100 ms-busy-loop interrupt pattern guarded by AtomicBoolean. That approach has several problems: * Busy-looping wastes CPU until the timeout elapses. * PDFBox does not always honour Thread.interrupt(), so the worker can keep running after the timeout and touch the PDDocument that the caller's try-with-resources is about to close, surfacing as a secondary "COSStream is closed" failure. * The worker thread is not guaranteed to be a daemon, so a runaway worker could prevent JVM shutdown. * Exceptions from the worker were funneled through a HashSet<Exception> holder, an error-prone pattern. This change rewrites the cancellation path on top of an ExecutorService: * Each call submits the extraction work to a per-call single-thread executor built from a shared daemon ThreadFactory. * Future.get(timeout, ms) replaces the busy-loop. TimeoutException -> cancel + ExtractException; ExecutionException is unwrapped; InterruptedException re-interrupts the calling thread before throwing ExtractException. * On the way out, the executor is shutdownNow()-ed and we awaitTermination for a configurable cancelGracePeriodMs (default 2000 ms) BEFORE the try-with-resources closes the PDDocument. This eliminates the COSStream-is-closed race when PDFBox does not honour the interrupt promptly. * createStripper() is now a protected factory method so tests can inject a custom PDFTextStripper that simulates slow extraction. * setDaemonThread is kept as a deprecated no-op for source compatibility; worker threads are always daemons now. New tests cover timeout-throws-ExtractException, post-cancellation reuse of the same extractor instance, and propagation when the calling thread is interrupted while waiting on Future.get.
…mpat and tone down race claim
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PdfExtractor's hand-rolled worker thread plusThreadUtil.sleep(100)busy-loop with anExecutorService+Future.get(timeout)based cancellation path.COSStream is closedrace that surfaced when PDFBox did not honourThread.interrupt(): the executor isshutdownNow-ed andawaitTermination-ed for a configurable grace period (default 2000 ms) before the try-with-resources closes thePDDocument.What changed
PdfExtractor.getText()now submits the extraction work to a per-call single-thread executor built from a shared daemonThreadFactory.TimeoutException->future.cancel(true)+ExtractException("PDFBox process cannot finish in N ms.")ExecutionExceptionis unwrapped and rethrown asExtractException(preserves any nestedExtractException).InterruptedExceptionre-interrupts the calling thread before throwingExtractException.createStripper()so tests can inject a slowPDFTextStripper(and so subclasses can customise stripping behaviour).cancelGracePeriodMs(default2000L) with getter/setter to tune how long we wait for the worker to stop aftershutdownNow()before closing thePDDocument.setDaemonThread(boolean)is kept as a deprecated no-op for source compatibility; worker threads are always daemons now.HashSet,Set,AtomicBoolean, andThreadUtilimports.Tests
Added three tests in
PdfExtractorTest:test_extractionTimeout_throwsExtractException- injects a stripper that sleeps 60s, sets timeout to 100 ms, assertsExtractExceptionwith the expected message and that the worker observed the interrupt.test_extractionCancellation_releasesThread- times out once, then verifies a subsequent extraction on the same instance succeeds (proving the executor was shut down cleanly and no resources leaked).test_extractionInterrupt_propagates- runs the extractor on a separate thread, interrupts the calling thread mid-Future.get, assertsExtractExceptionis thrown and the calling thread's interrupt status is preserved.Test plan
mvn -pl fess-crawler test -Dtest='PdfExtractorTest'(9/9 tests pass)mvn -pl fess-crawler test(only pre-existing failures remain:JodExtractorTestrequires LibreOffice;Hc4HttpClientTest#test_doHead_accessTimeoutTargetwas already failing onmaster; Docker-dependentSmbClientTest/StorageClientTest/GcsClientTest/S3ClientTesterrors)mvn formatter:format && mvn license:format(no diff)