Skip to content

fix(extractor): close MS Publisher/Visio document resources properly#160

Merged
marevol merged 2 commits intomasterfrom
fix/extractor-office-resource-leak
May 5, 2026
Merged

fix(extractor): close MS Publisher/Visio document resources properly#160
marevol merged 2 commits intomasterfrom
fix/extractor-office-resource-leak

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

  • Wrap PublisherTextExtractor / VisioTextExtractor in try-with-resources so the underlying POIFSFileSystem and caller's InputStream are always closed (POI text extractors implement Closeable via POITextExtractor).
  • Remove @SuppressWarnings("resource") annotations.
  • Use shared validateInputStream(in) helper from AbstractExtractor for null/empty checks.
  • Rethrow IOException as ExtractException with key=value context (error=...).

Threat model

Extractor parameters Map is admin-configured/internal. Only the InputStream content is untrusted. The previous code leaked the POIFSFileSystem and the caller's stream when extraction threw, which on a high-volume crawler accumulated open file handles.

Tests

  • Replaced test_getText_null with test_nullInput_throwsCrawlerSystemException.
  • Added test_corruptedInput_throwsExtractException, test_textInput_throwsExtractException, test_emptyInput_throwsExtractException.
  • Added test_corruptedInput_closesUnderlyingStream using a CloseTrackingInputStream to assert the caller's stream is closed even when extraction fails — this is the regression we are fixing.
  • No .pub/.vsd fixtures exist under src/test/resources/, so the "valid input + content assertion" variant is omitted in favor of the close-on-failure verification.

Verification

  • mvn -pl fess-crawler test -Dtest='MsPublisherExtractorTest,MsVisioExtractorTest' → 12/12 pass.
  • Full mvn -pl fess-crawler test → 1698 tests; only pre-existing Docker-environment failures remain (GcsClientTest, S3ClientTest, SmbClientTest, StorageClientTest).
  • mvn formatter:format license:format clean.

Test plan

  • CI green
  • Manual review of try-with-resources placement

marevol added 2 commits May 5, 2026 07:16
`MsPublisherExtractor` and `MsVisioExtractor` previously instantiated
`PublisherTextExtractor` / `VisioTextExtractor` without ever closing
them, hidden by `@SuppressWarnings("resource")`. Both extractor classes
own the underlying `POIFSFileSystem` (and therefore the caller-supplied
`InputStream`), so when extraction failed the file handle and memory
buffers were leaked.

This change wraps the POI extractors in try-with-resources so the
underlying filesystem is always closed, even on exception paths.
`@SuppressWarnings("resource")` is removed and the manual null check is
replaced with the shared `validateInputStream` helper from
`AbstractExtractor`. Caught `IOException`s are rethrown as
`ExtractException` with key=value diagnostic context.

Tests are extended with a `CloseTrackingInputStream` that asserts the
caller's stream is closed even when extraction throws, plus explicit
corrupted/empty/text input cases for both extractors.
The cause exception is already passed to ExtractException, so embedding
e.getMessage() in the message is duplicative. Match the convention used
in MsWordExtractor and MsExcelExtractor ("Failed to extract text from X
document.").
@marevol marevol merged commit 5eac93e into master May 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant