fix(extractor): close MS Publisher/Visio document resources properly#160
Merged
fix(extractor): close MS Publisher/Visio document resources properly#160
Conversation
`MsPublisherExtractor` and `MsVisioExtractor` previously instantiated
`PublisherTextExtractor` / `VisioTextExtractor` without ever closing
them, hidden by `@SuppressWarnings("resource")`. Both extractor classes
own the underlying `POIFSFileSystem` (and therefore the caller-supplied
`InputStream`), so when extraction failed the file handle and memory
buffers were leaked.
This change wraps the POI extractors in try-with-resources so the
underlying filesystem is always closed, even on exception paths.
`@SuppressWarnings("resource")` is removed and the manual null check is
replaced with the shared `validateInputStream` helper from
`AbstractExtractor`. Caught `IOException`s are rethrown as
`ExtractException` with key=value diagnostic context.
Tests are extended with a `CloseTrackingInputStream` that asserts the
caller's stream is closed even when extraction throws, plus explicit
corrupted/empty/text input cases for both extractors.
The cause exception is already passed to ExtractException, so embedding
e.getMessage() in the message is duplicative. Match the convention used
in MsWordExtractor and MsExcelExtractor ("Failed to extract text from X
document.").
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PublisherTextExtractor/VisioTextExtractorin try-with-resources so the underlyingPOIFSFileSystemand caller'sInputStreamare always closed (POI text extractors implementCloseableviaPOITextExtractor).@SuppressWarnings("resource")annotations.validateInputStream(in)helper fromAbstractExtractorfor null/empty checks.IOExceptionasExtractExceptionwithkey=valuecontext (error=...).Threat model
Extractor parameters Map is admin-configured/internal. Only the
InputStreamcontent is untrusted. The previous code leaked thePOIFSFileSystemand the caller's stream when extraction threw, which on a high-volume crawler accumulated open file handles.Tests
test_getText_nullwithtest_nullInput_throwsCrawlerSystemException.test_corruptedInput_throwsExtractException,test_textInput_throwsExtractException,test_emptyInput_throwsExtractException.test_corruptedInput_closesUnderlyingStreamusing aCloseTrackingInputStreamto assert the caller's stream is closed even when extraction fails — this is the regression we are fixing..pub/.vsdfixtures exist undersrc/test/resources/, so the "valid input + content assertion" variant is omitted in favor of the close-on-failure verification.Verification
mvn -pl fess-crawler test -Dtest='MsPublisherExtractorTest,MsVisioExtractorTest'→ 12/12 pass.mvn -pl fess-crawler test→ 1698 tests; only pre-existing Docker-environment failures remain (GcsClientTest,S3ClientTest,SmbClientTest,StorageClientTest).mvn formatter:format license:formatclean.Test plan