Skip to content

fix(extractor): defend against archive bomb, recursion, and Zip Slip#161

Open
marevol wants to merge 4 commits intomasterfrom
fix/extractor-archive-bomb-defense
Open

fix(extractor): defend against archive bomb, recursion, and Zip Slip#161
marevol wants to merge 4 commits intomasterfrom
fix/extractor-archive-bomb-defense

Conversation

@marevol
Copy link
Copy Markdown
Contributor

@marevol marevol commented May 4, 2026

Summary

Hardens ZipExtractor, TarExtractor and LhaExtractor against
content-driven attacks. The input stream is treated as untrusted; the
extractor params map remains admin-configured / trusted.

Protections added

  • Recursion-depth limit — new extractorDepth param tracked through
    AbstractExtractor (EXTRACTOR_DEPTH_KEY, getCurrentDepth,
    incrementDepth, checkDepth). Each archive extractor calls
    checkDepth on entry and incrementDepth when delegating to a nested
    extractor. Default maxArchiveDepth = 10.
  • Total uncompressed-byte cap (maxBytes, default 2 GiB) measured
    from the bytes actually read out of each entry stream — not from the
    declared entry.getSize() which can be -1 or fabricated.
  • Maximum entry count (maxEntries, default 100,000) to defend
    against many-zero-byte-entries DoS.
  • Compression-ratio threshold (maxCompressionRatio, default 100:1
    for entries > 1 MiB) using the central-directory compressed size where
    present and the bytes consumed from a CountingInputStream when the
    local header carries a data descriptor.
  • Zip Slip path-traversal rejection — entry names are normalised via
    Path.normalize() and rejected when they escape the conceptual
    extraction root, are absolute, or contain .. segments.
  • Tar symlink / hardlink skipping — entries with LF_SYMLINK /
    LF_LINK are skipped with a debug log so they cannot leak content
    from outside the archive.
  • Configurable filename encoding (filenameEncoding, default
    UTF-8; supports CP932 / MS932 for Japanese archives).
  • Early ZIP signature validation so a clearly non-ZIP blob is
    reported as ExtractException instead of returning empty content
    silently.

How to override

Field setters via DI (these are admin-configured, not params):

Setter Default
setMaxArchiveDepth(int) (AbstractExtractor) 10
setMaxBytes(long) 2 GiB
setMaxEntries(int) 100_000
setMaxCompressionRatio(long) (Zip) 100
setFilenameEncoding(String) (Zip) UTF-8

Param key (caller-controlled, used for recursive calls):

  • extractorDepth — current depth value, parsed as integer.

Threat model

  • Untrusted: archive content stream
  • Trusted: extractor params map (admin-configured / internal)

Test plan

  • AbstractExtractorTest: depth helper coverage (null/blank/garbage,
    incrementDepth returns NEW map, checkDepth at/above limit).
  • ArchiveExtractorSecurityTest (new): 15 synthetic-archive tests:
    • test_zipBomb_byteLimit
    • test_zipBomb_entryLimit
    • test_zipSlip_pathTraversal, test_zipSlip_absolutePath
    • test_recursionDepth_exceeded,
      test_recursionDepth_belowLimit_succeeds
    • test_cp932Filename
    • test_tar_symlinkSkipped, test_tar_hardlinkSkipped,
      test_tar_pathTraversal
    • test_compressionRatioExceeded
    • test_tarBomb_byteLimit, test_tarBomb_entryLimit,
      test_tar_recursionDepth_exceeded
    • test_lha_recursionDepth_exceeded
  • All previously-passing archive tests still pass
    (ZipExtractorTest, TarExtractorTest, LhaExtractorTest,
    ArchiveExtractorErrorHandlingTest).
  • No regressions in the wider test suite — only env-dependent
    failures remain (Docker for SMB / S3 / GCS / Storage clients,
    LibreOffice for JodExtractor).

marevol and others added 4 commits May 5, 2026 07:31
Adds defensive limits to ZipExtractor, TarExtractor and LhaExtractor so
that a malicious content stream can no longer cause unbounded resource
consumption or write outside its conceptual extraction root.

Threat model assumption: the InputStream is untrusted; the params Map is
admin-configured / trusted.

Protections:
- AbstractExtractor: extractorDepth param + checkDepth/getCurrentDepth/
  incrementDepth helpers, plus configurable maxArchiveDepth (default 10).
  Recursive archive-in-archive expansions now propagate an incremented
  depth and abort with MaxLengthExceededException at the threshold.
- ZipExtractor: per-archive caps (maxBytes 2 GiB, maxEntries 100k,
  maxCompressionRatio 100:1 for entries > 1 MiB), Zip Slip path-traversal
  rejection via Path.normalize(), early ZIP signature validation, and
  configurable filenameEncoding (UTF-8 / CP932 / MS932) for non-UTF8
  archive names. Compression ratio uses the central directory size when
  available and falls back to bytes consumed from a CountingInputStream.
- TarExtractor: maxBytes / maxEntries caps, depth check, path-traversal
  rejection, and explicit skipping of symbolic-link and hard-link entries
  to avoid leaking content from outside the archive sandbox.
- LhaExtractor: maxBytes / maxEntries caps, depth check, path-traversal
  rejection.

Tests:
- AbstractExtractorTest: new coverage for getCurrentDepth /
  incrementDepth (returns NEW map, leaves input untouched) / checkDepth
  pass and fail paths.
- ArchiveExtractorSecurityTest: 15 synthetic-archive tests covering each
  threat (zip byte / entry / ratio bombs, Zip Slip, absolute paths,
  recursion-depth, CP932 filenames, tar symlink/hardlink/path-traversal,
  tar byte/entry caps and depth, lha depth).

All previously-passing extractor tests continue to pass; failures in the
suite are limited to environment-dependent tests (Docker for SMB / S3 /
GCS / Storage clients, LibreOffice for JodExtractor) which are unchanged
by this commit.
- LhaExtractor now stages the input archive through copyBounded with a
  configurable maxInputBytes cap (default 1 GiB) so a hostile producer
  cannot fill local storage before LhaFile is opened.
- The per-entry size cap is enforced against bytes actually decompressed
  (mirroring Zip/Tar) instead of the attacker-controlled
  LhaHeader#getOriginalSize. Entries are now buffered into a bounded
  ByteArrayOutputStream and forwarded to downstream extractors via
  ByteArrayInputStream, dropping the unbounded IgnoreCloseInputStream
  hand-off.
- Down-scale test_perEntryCapEnforced (300 MiB -> 2 MiB payload with a
  1 MiB cap) so it stays cheap on parallel/low-memory CI, and add
  test_lha_maxInputBytes_capsStaging covering the new input staging
  cap.
… into read budget

Addresses PR #161 review feedback:

1) Unsupported entries (no registered Extractor) are now skipped without
   buffering, mirroring the pre-defence behaviour. A large irrelevant
   entry (e.g. a video alongside a small .txt) no longer consumes the
   per-entry / total caps that should be reserved for entries the
   crawler actually wants to extract.

2) The legacy maxContentSize field is folded into the per-entry read
   budget (alongside maxBytes and maxBytesPerEntry). A small legacy cap
   (e.g. 10 MiB) now stops the buffer growing past it, instead of being
   checked only after up to 256 MiB+1 had been buffered.

For Zip, the compressed-bytes anchor is also advanced for skipped /
rejected entries so the next supported entry's compression-ratio is
computed against its own compressed bytes, not also those of the
skipped ones.

Tests: updates test_perEntryCapEnforced to use a supported (.txt) entry
and adds three regressions:
- test_zip_unsupportedEntryDoesNotConsumeCaps
- test_tar_unsupportedEntryDoesNotConsumeCaps
- test_zip_maxContentSize_capsBufferBeforePerEntryCap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant