fix(extractor): decode RFC 2047 headers and bound EML recursion/bytes#166
Open
fix(extractor): decode RFC 2047 headers and bound EML recursion/bytes#166
Conversation
EML content is untrusted. The new bounds defend against deeply nested or massively multi-part messages that could exhaust memory. - Decode RFC 2047 encoded-word headers (Subject, From, To, Cc, Bcc, Reply-To) via MimeUtility.decodeText for the new normalized metadata keys (subject, from, to, cc, bcc, replyTo). - Add maxRecursionDepth (default 10) for nested message/rfc822 and multipart parts; throw MaxLengthExceededException when exceeded. - Add maxParts (default 1000) and maxBodyBytes (default 50 MiB) DoS guards. - Expose attachmentNames (multivalue metadata) without extracting binary content. - Set common metadata: subject, from, to, cc, bcc, replyTo, sentDate, receivedDate, messageId. - Preserve previous behavior: text alternatives prefer text/plain, legacy headers (Subject, From, To, ...) remain available. Adds tests for body extraction, RFC 2047 decoding (Subject and From display name), attachment filename collection, recursion bomb, max parts, body byte truncation, and multipart/alternative preference.
…boundary walk The previous CharsetEncoder approach allocated a ByteBuffer sized to the entire remaining maxBodyBytes budget (50 MiB by default) on every appendBody call — even for small text parts. Under concurrent multipart EML processing this multiplied to gigabytes of throwaway allocations. Encode the text once with String.getBytes(UTF_8) (memory proportional to input, not budget) and walk back over UTF-8 continuation bytes to land on a code-point boundary when truncation is needed. Adds test_maxBodyBytes_truncatesAtUtf8CodePointBoundary verifying the boundary walk-back never produces a U+FFFD replacement char.
Three audit findings on PR #166: - multipart/alternative previously charged only the chosen child to ctx.partCount, so an attacker could bypass maxParts by stuffing thousands of unused alternatives. Now charges count - 1 for skipped alternatives (the chosen one is counted via its own extractBody call), re-checking the cap before recursion. - text/* parts were fully decoded into a String via Part.getContent() before any maxBodyBytes check, peaking heap at multiples of the part size. Replaced with a streaming read from Part.getInputStream() capped at 4 * remaining-UTF-8-budget + 16 bytes (enough to fill any UTF-8 cap regardless of source charset, but bounded relative to maxBodyBytes rather than to the part size). - appendBody appended a trailing space even when the encoded text exactly filled the remaining budget, exceeding maxBodyBytes by 1. Reserve the separator byte before taking the fit branch and guard cutoff < bytes.length when walking back continuation bytes. Adds regression tests for the alternative-bypass and the strict cap.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Subject,From,To,Cc,Bcc,Reply-To) viaMimeUtility.decodeText.maxRecursionDepth(default 10) for nestedmessage/rfc822parts.maxParts(default 1000) andmaxBodyBytes(default 50 MiB) DoS guards.attachmentNames(multivalue metadata) without extracting binary content.subject,from,to,cc,sentDate,receivedDate,messageId.Threat model
EML content is untrusted. The new bounds defend against deeply nested or massively multi-part messages that could exhaust memory.
Tests
Test plan