fix(extractor): defend against archive bomb, recursion, and Zip Slip#161
Open
fix(extractor): defend against archive bomb, recursion, and Zip Slip#161
Conversation
Adds defensive limits to ZipExtractor, TarExtractor and LhaExtractor so that a malicious content stream can no longer cause unbounded resource consumption or write outside its conceptual extraction root. Threat model assumption: the InputStream is untrusted; the params Map is admin-configured / trusted. Protections: - AbstractExtractor: extractorDepth param + checkDepth/getCurrentDepth/ incrementDepth helpers, plus configurable maxArchiveDepth (default 10). Recursive archive-in-archive expansions now propagate an incremented depth and abort with MaxLengthExceededException at the threshold. - ZipExtractor: per-archive caps (maxBytes 2 GiB, maxEntries 100k, maxCompressionRatio 100:1 for entries > 1 MiB), Zip Slip path-traversal rejection via Path.normalize(), early ZIP signature validation, and configurable filenameEncoding (UTF-8 / CP932 / MS932) for non-UTF8 archive names. Compression ratio uses the central directory size when available and falls back to bytes consumed from a CountingInputStream. - TarExtractor: maxBytes / maxEntries caps, depth check, path-traversal rejection, and explicit skipping of symbolic-link and hard-link entries to avoid leaking content from outside the archive sandbox. - LhaExtractor: maxBytes / maxEntries caps, depth check, path-traversal rejection. Tests: - AbstractExtractorTest: new coverage for getCurrentDepth / incrementDepth (returns NEW map, leaves input untouched) / checkDepth pass and fail paths. - ArchiveExtractorSecurityTest: 15 synthetic-archive tests covering each threat (zip byte / entry / ratio bombs, Zip Slip, absolute paths, recursion-depth, CP932 filenames, tar symlink/hardlink/path-traversal, tar byte/entry caps and depth, lha depth). All previously-passing extractor tests continue to pass; failures in the suite are limited to environment-dependent tests (Docker for SMB / S3 / GCS / Storage clients, LibreOffice for JodExtractor) which are unchanged by this commit.
- LhaExtractor now stages the input archive through copyBounded with a configurable maxInputBytes cap (default 1 GiB) so a hostile producer cannot fill local storage before LhaFile is opened. - The per-entry size cap is enforced against bytes actually decompressed (mirroring Zip/Tar) instead of the attacker-controlled LhaHeader#getOriginalSize. Entries are now buffered into a bounded ByteArrayOutputStream and forwarded to downstream extractors via ByteArrayInputStream, dropping the unbounded IgnoreCloseInputStream hand-off. - Down-scale test_perEntryCapEnforced (300 MiB -> 2 MiB payload with a 1 MiB cap) so it stays cheap on parallel/low-memory CI, and add test_lha_maxInputBytes_capsStaging covering the new input staging cap.
… into read budget Addresses PR #161 review feedback: 1) Unsupported entries (no registered Extractor) are now skipped without buffering, mirroring the pre-defence behaviour. A large irrelevant entry (e.g. a video alongside a small .txt) no longer consumes the per-entry / total caps that should be reserved for entries the crawler actually wants to extract. 2) The legacy maxContentSize field is folded into the per-entry read budget (alongside maxBytes and maxBytesPerEntry). A small legacy cap (e.g. 10 MiB) now stops the buffer growing past it, instead of being checked only after up to 256 MiB+1 had been buffered. For Zip, the compressed-bytes anchor is also advanced for skipped / rejected entries so the next supported entry's compression-ratio is computed against its own compressed bytes, not also those of the skipped ones. Tests: updates test_perEntryCapEnforced to use a supported (.txt) entry and adds three regressions: - test_zip_unsupportedEntryDoesNotConsumeCaps - test_tar_unsupportedEntryDoesNotConsumeCaps - test_zip_maxContentSize_capsBufferBeforePerEntryCap
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens
ZipExtractor,TarExtractorandLhaExtractoragainstcontent-driven attacks. The input stream is treated as untrusted; the
extractor
paramsmap remains admin-configured / trusted.Protections added
extractorDepthparam tracked throughAbstractExtractor(EXTRACTOR_DEPTH_KEY,getCurrentDepth,incrementDepth,checkDepth). Each archive extractor callscheckDepthon entry andincrementDepthwhen delegating to a nestedextractor. Default
maxArchiveDepth = 10.maxBytes, default 2 GiB) measuredfrom the bytes actually read out of each entry stream — not from the
declared
entry.getSize()which can be-1or fabricated.maxEntries, default 100,000) to defendagainst many-zero-byte-entries DoS.
maxCompressionRatio, default 100:1for entries > 1 MiB) using the central-directory compressed size where
present and the bytes consumed from a
CountingInputStreamwhen thelocal header carries a data descriptor.
Path.normalize()and rejected when they escape the conceptualextraction root, are absolute, or contain
..segments.LF_SYMLINK/LF_LINKare skipped with a debug log so they cannot leak contentfrom outside the archive.
filenameEncoding, defaultUTF-8; supportsCP932/MS932for Japanese archives).reported as
ExtractExceptioninstead of returning empty contentsilently.
How to override
Field setters via DI (these are admin-configured, not params):
setMaxArchiveDepth(int)(AbstractExtractor)10setMaxBytes(long)2 GiBsetMaxEntries(int)100_000setMaxCompressionRatio(long)(Zip)100setFilenameEncoding(String)(Zip)UTF-8Param key (caller-controlled, used for recursive calls):
extractorDepth— current depth value, parsed as integer.Threat model
paramsmap (admin-configured / internal)Test plan
AbstractExtractorTest: depth helper coverage (null/blank/garbage,incrementDepthreturns NEW map,checkDepthat/above limit).ArchiveExtractorSecurityTest(new): 15 synthetic-archive tests:test_zipBomb_byteLimittest_zipBomb_entryLimittest_zipSlip_pathTraversal,test_zipSlip_absolutePathtest_recursionDepth_exceeded,test_recursionDepth_belowLimit_succeedstest_cp932Filenametest_tar_symlinkSkipped,test_tar_hardlinkSkipped,test_tar_pathTraversaltest_compressionRatioExceededtest_tarBomb_byteLimit,test_tarBomb_entryLimit,test_tar_recursionDepth_exceededtest_lha_recursionDepth_exceeded(
ZipExtractorTest,TarExtractorTest,LhaExtractorTest,ArchiveExtractorErrorHandlingTest).failures remain (Docker for SMB / S3 / GCS / Storage clients,
LibreOffice for
JodExtractor).