Skip to content

fix(extractor): rewrite PDF extraction cancellation with ExecutorService#158

Merged
marevol merged 2 commits intomasterfrom
fix/extractor-pdf-cancellation
May 5, 2026
Merged

fix(extractor): rewrite PDF extraction cancellation with ExecutorService#158
marevol merged 2 commits intomasterfrom
fix/extractor-pdf-cancellation

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

  • Replaces PdfExtractor's hand-rolled worker thread plus ThreadUtil.sleep(100) busy-loop with an ExecutorService + Future.get(timeout) based cancellation path.
  • Eliminates the COSStream is closed race that surfaced when PDFBox did not honour Thread.interrupt(): the executor is shutdownNow-ed and awaitTermination-ed for a configurable grace period (default 2000 ms) before the try-with-resources closes the PDDocument.
  • Worker threads are now always daemons, so a runaway PDFBox call cannot hold up JVM shutdown.

What changed

  • PdfExtractor.getText() now submits the extraction work to a per-call single-thread executor built from a shared daemon ThreadFactory.
    • TimeoutException -> future.cancel(true) + ExtractException("PDFBox process cannot finish in N ms.")
    • ExecutionException is unwrapped and rethrown as ExtractException (preserves any nested ExtractException).
    • InterruptedException re-interrupts the calling thread before throwing ExtractException.
  • New protected factory method createStripper() so tests can inject a slow PDFTextStripper (and so subclasses can customise stripping behaviour).
  • New field cancelGracePeriodMs (default 2000L) with getter/setter to tune how long we wait for the worker to stop after shutdownNow() before closing the PDDocument.
  • setDaemonThread(boolean) is kept as a deprecated no-op for source compatibility; worker threads are always daemons now.
  • Removed unused HashSet, Set, AtomicBoolean, and ThreadUtil imports.

Tests

Added three tests in PdfExtractorTest:

  • test_extractionTimeout_throwsExtractException - injects a stripper that sleeps 60s, sets timeout to 100 ms, asserts ExtractException with the expected message and that the worker observed the interrupt.
  • test_extractionCancellation_releasesThread - times out once, then verifies a subsequent extraction on the same instance succeeds (proving the executor was shut down cleanly and no resources leaked).
  • test_extractionInterrupt_propagates - runs the extractor on a separate thread, interrupts the calling thread mid-Future.get, asserts ExtractException is thrown and the calling thread's interrupt status is preserved.

Test plan

  • mvn -pl fess-crawler test -Dtest='PdfExtractorTest' (9/9 tests pass)
  • Full mvn -pl fess-crawler test (only pre-existing failures remain: JodExtractorTest requires LibreOffice; Hc4HttpClientTest#test_doHead_accessTimeoutTarget was already failing on master; Docker-dependent SmbClientTest/StorageClientTest/GcsClientTest/S3ClientTest errors)
  • mvn formatter:format && mvn license:format (no diff)

marevol added 2 commits May 5, 2026 07:22
…tion cancellation

The previous PdfExtractor.getText() relied on a hand-rolled worker thread plus
a 100 ms-busy-loop interrupt pattern guarded by AtomicBoolean. That approach
has several problems:

* Busy-looping wastes CPU until the timeout elapses.
* PDFBox does not always honour Thread.interrupt(), so the worker can keep
  running after the timeout and touch the PDDocument that the caller's
  try-with-resources is about to close, surfacing as a secondary
  "COSStream is closed" failure.
* The worker thread is not guaranteed to be a daemon, so a runaway worker
  could prevent JVM shutdown.
* Exceptions from the worker were funneled through a HashSet<Exception>
  holder, an error-prone pattern.

This change rewrites the cancellation path on top of an ExecutorService:

* Each call submits the extraction work to a per-call single-thread executor
  built from a shared daemon ThreadFactory.
* Future.get(timeout, ms) replaces the busy-loop. TimeoutException -> cancel
  + ExtractException; ExecutionException is unwrapped; InterruptedException
  re-interrupts the calling thread before throwing ExtractException.
* On the way out, the executor is shutdownNow()-ed and we awaitTermination
  for a configurable cancelGracePeriodMs (default 2000 ms) BEFORE the
  try-with-resources closes the PDDocument. This eliminates the
  COSStream-is-closed race when PDFBox does not honour the interrupt
  promptly.
* createStripper() is now a protected factory method so tests can inject a
  custom PDFTextStripper that simulates slow extraction.
* setDaemonThread is kept as a deprecated no-op for source compatibility;
  worker threads are always daemons now.

New tests cover timeout-throws-ExtractException, post-cancellation reuse of
the same extractor instance, and propagation when the calling thread is
interrupted while waiting on Future.get.
@marevol marevol merged commit 6ddc9be into master May 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant