fix(extractor): defend against archive bomb, recursion, and Zip Slip by marevol · Pull Request #161 · codelibs/fess-crawler

marevol · 2026-05-04T22:32:30Z

Summary

Hardens ZipExtractor, TarExtractor and LhaExtractor against
content-driven attacks. The input stream is treated as untrusted; the
extractor params map remains admin-configured / trusted.

Protections added

Recursion-depth limit — new extractorDepth param tracked through
AbstractExtractor (EXTRACTOR_DEPTH_KEY, getCurrentDepth,
incrementDepth, checkDepth). Each archive extractor calls
checkDepth on entry and incrementDepth when delegating to a nested
extractor. Default maxArchiveDepth = 10.
Total uncompressed-byte cap (maxBytes, default 2 GiB) measured
from the bytes actually read out of each entry stream — not from the
declared entry.getSize() which can be -1 or fabricated.
Maximum entry count (maxEntries, default 100,000) to defend
against many-zero-byte-entries DoS.
Compression-ratio threshold (maxCompressionRatio, default 100:1
for entries > 1 MiB) using the central-directory compressed size where
present and the bytes consumed from a CountingInputStream when the
local header carries a data descriptor.
Zip Slip path-traversal rejection — entry names are normalised via
Path.normalize() and rejected when they escape the conceptual
extraction root, are absolute, or contain .. segments.
Tar symlink / hardlink skipping — entries with LF_SYMLINK /
LF_LINK are skipped with a debug log so they cannot leak content
from outside the archive.
Configurable filename encoding (filenameEncoding, default
UTF-8; supports CP932 / MS932 for Japanese archives).
Early ZIP signature validation so a clearly non-ZIP blob is
reported as ExtractException instead of returning empty content
silently.

How to override

Field setters via DI (these are admin-configured, not params):

Setter	Default
`setMaxArchiveDepth(int)` (AbstractExtractor)	`10`
`setMaxBytes(long)`	`2 GiB`
`setMaxEntries(int)`	`100_000`
`setMaxCompressionRatio(long)` (Zip)	`100`
`setFilenameEncoding(String)` (Zip)	`UTF-8`

Param key (caller-controlled, used for recursive calls):

extractorDepth — current depth value, parsed as integer.

Threat model

Untrusted: archive content stream
Trusted: extractor params map (admin-configured / internal)

Test plan

AbstractExtractorTest: depth helper coverage (null/blank/garbage,
incrementDepth returns NEW map, checkDepth at/above limit).
ArchiveExtractorSecurityTest (new): 15 synthetic-archive tests:
- test_zipBomb_byteLimit
- test_zipBomb_entryLimit
- test_zipSlip_pathTraversal, test_zipSlip_absolutePath
- test_recursionDepth_exceeded,
  test_recursionDepth_belowLimit_succeeds
- test_cp932Filename
- test_tar_symlinkSkipped, test_tar_hardlinkSkipped,
  test_tar_pathTraversal
- test_compressionRatioExceeded
- test_tarBomb_byteLimit, test_tarBomb_entryLimit,
  test_tar_recursionDepth_exceeded
- test_lha_recursionDepth_exceeded
All previously-passing archive tests still pass
(ZipExtractorTest, TarExtractorTest, LhaExtractorTest,
ArchiveExtractorErrorHandlingTest).
No regressions in the wider test suite — only env-dependent
failures remain (Docker for SMB / S3 / GCS / Storage clients,
LibreOffice for JodExtractor).

Adds defensive limits to ZipExtractor, TarExtractor and LhaExtractor so that a malicious content stream can no longer cause unbounded resource consumption or write outside its conceptual extraction root. Threat model assumption: the InputStream is untrusted; the params Map is admin-configured / trusted. Protections: - AbstractExtractor: extractorDepth param + checkDepth/getCurrentDepth/ incrementDepth helpers, plus configurable maxArchiveDepth (default 10). Recursive archive-in-archive expansions now propagate an incremented depth and abort with MaxLengthExceededException at the threshold. - ZipExtractor: per-archive caps (maxBytes 2 GiB, maxEntries 100k, maxCompressionRatio 100:1 for entries > 1 MiB), Zip Slip path-traversal rejection via Path.normalize(), early ZIP signature validation, and configurable filenameEncoding (UTF-8 / CP932 / MS932) for non-UTF8 archive names. Compression ratio uses the central directory size when available and falls back to bytes consumed from a CountingInputStream. - TarExtractor: maxBytes / maxEntries caps, depth check, path-traversal rejection, and explicit skipping of symbolic-link and hard-link entries to avoid leaking content from outside the archive sandbox. - LhaExtractor: maxBytes / maxEntries caps, depth check, path-traversal rejection. Tests: - AbstractExtractorTest: new coverage for getCurrentDepth / incrementDepth (returns NEW map, leaves input untouched) / checkDepth pass and fail paths. - ArchiveExtractorSecurityTest: 15 synthetic-archive tests covering each threat (zip byte / entry / ratio bombs, Zip Slip, absolute paths, recursion-depth, CP932 filenames, tar symlink/hardlink/path-traversal, tar byte/entry caps and depth, lha depth). All previously-passing extractor tests continue to pass; failures in the suite are limited to environment-dependent tests (Docker for SMB / S3 / GCS / Storage clients, LibreOffice for JodExtractor) which are unchanged by this commit.

…nded helpers

- LhaExtractor now stages the input archive through copyBounded with a configurable maxInputBytes cap (default 1 GiB) so a hostile producer cannot fill local storage before LhaFile is opened. - The per-entry size cap is enforced against bytes actually decompressed (mirroring Zip/Tar) instead of the attacker-controlled LhaHeader#getOriginalSize. Entries are now buffered into a bounded ByteArrayOutputStream and forwarded to downstream extractors via ByteArrayInputStream, dropping the unbounded IgnoreCloseInputStream hand-off. - Down-scale test_perEntryCapEnforced (300 MiB -> 2 MiB payload with a 1 MiB cap) so it stays cheap on parallel/low-memory CI, and add test_lha_maxInputBytes_capsStaging covering the new input staging cap.

… into read budget Addresses PR #161 review feedback: 1) Unsupported entries (no registered Extractor) are now skipped without buffering, mirroring the pre-defence behaviour. A large irrelevant entry (e.g. a video alongside a small .txt) no longer consumes the per-entry / total caps that should be reserved for entries the crawler actually wants to extract. 2) The legacy maxContentSize field is folded into the per-entry read budget (alongside maxBytes and maxBytesPerEntry). A small legacy cap (e.g. 10 MiB) now stops the buffer growing past it, instead of being checked only after up to 256 MiB+1 had been buffered. For Zip, the compressed-bytes anchor is also advanced for skipped / rejected entries so the next supported entry's compression-ratio is computed against its own compressed bytes, not also those of the skipped ones. Tests: updates test_perEntryCapEnforced to use a supported (.txt) entry and adds three regressions: - test_zip_unsupportedEntryDoesNotConsumeCaps - test_tar_unsupportedEntryDoesNotConsumeCaps - test_zip_maxContentSize_capsBufferBeforePerEntryCap

marevol and others added 4 commits May 5, 2026 07:31

fix(extractor): add per-entry size cap and dedupe Zip Slip / copy-bou…

102b22f

…nded helpers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(extractor): defend against archive bomb, recursion, and Zip Slip#161

fix(extractor): defend against archive bomb, recursion, and Zip Slip#161
marevol wants to merge 4 commits intomasterfrom
fix/extractor-archive-bomb-defense

marevol commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marevol commented May 4, 2026

Summary

Protections added

How to override

Threat model

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant