Optimize token storage with offset-based windows instead of strings#1147
Open
Soner (shyim) wants to merge 6 commits into
Open
Optimize token storage with offset-based windows instead of strings#1147Soner (shyim) wants to merge 6 commits into
Soner (shyim) wants to merge 6 commits into
Conversation
9f94895 to
cb71d4a
Compare
Attribute values were boxed into the Node interface one malloc at a time when appended to ElementNode.Attributes. Allocation profiling of the parse benchmarks showed this single append site as the largest remaining allocator (~39% of all objects), the last value-typed node in an otherwise pointer-based AST. Slab-allocate attributes like RawNode/ElementNode/TemplateExpressionNode and store *Attribute in the NodeList, and give Attribute.Dump a pointer receiver so the whole AST is uniformly pointer-based. The slab is pre-sized from the measured ~1-attribute-per-32-tokens ratio and grows on demand. Parsing a large concatenated corpus is ~18% faster with ~15% fewer allocations; the small-file corpus is ~20% faster. Callers in internal/verifier that type-assert or construct attributes are updated to the pointer form. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
…arena Two allocation hotspots surfaced by profiling the parse benchmarks after the attribute change: - The lexer pre-sized its token slice to len(src)/6, but the measured token density on real templates is ~5.7 source bytes per token, so the estimate fell just short and forced exactly one geometric grow — reallocating and write-barrier-copying the entire (~70k-entry) token array on every large parse. The code drifted from its own comment, which already prescribed len(src)/5. Restoring /5 removes that grow: BenchmarkParseLarge drops from ~11.7ms to ~6.5ms (~44% faster) and its allocated bytes from 11MB to 6.5MB. - collect() allocated one exact-size NodeList per child list (every element's children, every block/if body) — the largest object allocator once attributes were slab-backed. Pack them end to end in a shared node arena and return capped subslices instead; a consumer appending to node.Children reallocates rather than clobbering the next list. Cuts BenchmarkParseLarge allocations ~46% (12009 to 6459) at neutral CPU and equal memory, easing GC pressure when formatting many files. Fixtures, fuzz, and race checks all pass unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
posTracker.advance counted newlines with strings.Count on every call, but profiling showed most calls advance by only 1–2 bytes (skipping a delimiter, bracket, or single character). At that size the SIMD-accelerated strings.Count is dominated by its own call and dispatch overhead — counting newlines in a one-byte string is almost all overhead. Scan spans of 16 bytes or fewer with an inline byte loop and reserve strings.Count for the larger runs (raw text, expression and comment bodies) where the SIMD path pays off. Line numbers are computed identically, so all fixtures, fuzz, and race checks pass unchanged. BenchmarkParseLarge improves from ~6.6ms to ~5.8ms (~12% faster, 60→69 MB/s); advance drops from ~18% to ~8% of parse CPU. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
Once the token buffer was right-sized, profiling showed the remaining lexer cost was GC-related: the token stream is one large []token, and because token held two string fields (Lit, Raw) the whole buffer was full of pointers — so the GC scanned all ~4.5MB of it every cycle and every emit append carried write barriers (bulkBarrierPreWriteSrcOnly was ~24% of parse CPU at one point). Every lexed literal is a substring of the source — even the whitespace-trimmed bodies, since strings.TrimSpace/TrimRight/TrimSuffix return subslices — so a [offset,len) window loses no information. Store Lit and Raw as int32 offsets and recover the strings via Lit(src)/Raw(src) accessors. token is now pointer-free and shrinks from 64 to 48 bytes, so the buffer is never GC-scanned and appends need no write barriers. BenchmarkLexCorpus improves ~15% (63→73 MB/s) with 27% less memory; parse allocates ~20% less overall (ParseLarge 6.55MB→5.27MB) and emit falls from ~10% to ~2% of CPU. All fixtures stay byte-identical; race and both fuzz targets (FuzzLexer, FuzzParser) pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
An 8.5MB html.test build artifact had been committed by accident; ignore *.test so it cannot recur. The binary itself is removed from this branch's history in the same push. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
Trim the explanatory comments added with the parsing optimizations down to what the code needs — drop the embedded profiling narrative and restate the invariants concisely. Also simplify the tokTwigIdent emit guard: identEnd > identStart || identEnd > wsStart reduces to identEnd > wsStart since wsStart <= identStart. No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
cb71d4a to
99dafcd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors the lexer's token representation to store literal and raw text as
[offset, length)windows into the source string rather than as separate string allocations. This reduces memory overhead and GC pressure while maintaining the same functionality.Key Changes
Token Storage Optimization
tokenstruct from storingLitandRawas strings to storing them aslitOff,litLen,rawOff,rawLenint32 fieldsLit(src)andRaw(src)accessor methods to recover strings from offsetsLitLen()andRawLen()helper methodsLexer Refactoring
trimSpaceWindow()function to compute trimmed text windows using Unicode semantics (matchingstrings.TrimSpacebehavior) without allocating new stringsmkTok()helper to construct tokens from offset windowsspan()helper for the common case where Lit and Raw are identicallen(src)/6+16tolen(src)/5+16for better sizingParser Updates
nodeArenafield to pack all collected child node lists end-to-end, replacing individualmake()calls per listattrSlabandnewAttrNode()for attribute allocation poolingcollect()to append scratch entries to the shared arena instead of allocating new slicesLit(p.source)andRaw(p.source)accessorsPosition Tracking Optimization
advance()to count newlines inline for small spans (≤16 bytes) before falling back tostrings.Count()for larger spans, reducing call overheadType Changes
Attributeto be stored as pointers in NodeLists (matching other node types)*Attributetype assertions and pointer constructionTest Updates
Lit(src)accessor instead of accessing string field directlyImplementation Details
trimSpaceWindow()function precisely mirrorsstrings.TrimSpacesemantics using Unicode character classificationhttps://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu