Optimize token storage with offset-based windows instead of strings by shyim · Pull Request #1147 · shopware/shopware-cli

Soner (shyim) · 2026-07-01T06:48:56Z

Summary

Refactors the lexer's token representation to store literal and raw text as [offset, length) windows into the source string rather than as separate string allocations. This reduces memory overhead and GC pressure while maintaining the same functionality.

Key Changes

Token Storage Optimization

tokens.go: Changed token struct from storing Lit and Raw as strings to storing them as litOff, litLen, rawOff, rawLen int32 fields
- Added Lit(src) and Raw(src) accessor methods to recover strings from offsets
- Added LitLen() and RawLen() helper methods
- Tokens are now pointer-free, eliminating GC scanning and write barriers on token slice appends

Lexer Refactoring

lexer.go:
- Added trimSpaceWindow() function to compute trimmed text windows using Unicode semantics (matching strings.TrimSpace behavior) without allocating new strings
- Added mkTok() helper to construct tokens from offset windows
- Added span() helper for the common case where Lit and Raw are identical
- Updated all token emission sites to use offset-based construction instead of string slicing
- Adjusted token slice pre-allocation from len(src)/6+16 to len(src)/5+16 for better sizing

Parser Updates

parser.go:
- Added nodeArena field to pack all collected child node lists end-to-end, replacing individual make() calls per list
- Added attrSlab and newAttrNode() for attribute allocation pooling
- Updated collect() to append scratch entries to the shared arena instead of allocating new slices
- Pre-allocate arena and attribute slab based on token count estimates
- Updated all token field accesses to use Lit(p.source) and Raw(p.source) accessors

Position Tracking Optimization

pos.go: Optimized advance() to count newlines inline for small spans (≤16 bytes) before falling back to strings.Count() for larger spans, reducing call overhead

Type Changes

ast.go: Changed Attribute to be stored as pointers in NodeLists (matching other node types)
Updated all fixer and checker code to use *Attribute type assertions and pointer construction

Test Updates

lexer_test.go: Updated test assertions to call Lit(src) accessor instead of accessing string field directly

Implementation Details

All lexed literals are substrings of the source, so no information is lost by storing offsets
The trimSpaceWindow() function precisely mirrors strings.TrimSpace semantics using Unicode character classification
Token slices remain GC-efficient since tokens are now pointer-free (no write barriers needed)
Node arena packing reduces allocations for typical templates with many small child lists
Backward compatible: accessor methods provide the same string interface as before

https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu

Attribute values were boxed into the Node interface one malloc at a time when appended to ElementNode.Attributes. Allocation profiling of the parse benchmarks showed this single append site as the largest remaining allocator (~39% of all objects), the last value-typed node in an otherwise pointer-based AST. Slab-allocate attributes like RawNode/ElementNode/TemplateExpressionNode and store *Attribute in the NodeList, and give Attribute.Dump a pointer receiver so the whole AST is uniformly pointer-based. The slab is pre-sized from the measured ~1-attribute-per-32-tokens ratio and grows on demand. Parsing a large concatenated corpus is ~18% faster with ~15% fewer allocations; the small-file corpus is ~20% faster. Callers in internal/verifier that type-assert or construct attributes are updated to the pointer form. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu

…arena Two allocation hotspots surfaced by profiling the parse benchmarks after the attribute change: - The lexer pre-sized its token slice to len(src)/6, but the measured token density on real templates is ~5.7 source bytes per token, so the estimate fell just short and forced exactly one geometric grow — reallocating and write-barrier-copying the entire (~70k-entry) token array on every large parse. The code drifted from its own comment, which already prescribed len(src)/5. Restoring /5 removes that grow: BenchmarkParseLarge drops from ~11.7ms to ~6.5ms (~44% faster) and its allocated bytes from 11MB to 6.5MB. - collect() allocated one exact-size NodeList per child list (every element's children, every block/if body) — the largest object allocator once attributes were slab-backed. Pack them end to end in a shared node arena and return capped subslices instead; a consumer appending to node.Children reallocates rather than clobbering the next list. Cuts BenchmarkParseLarge allocations ~46% (12009 to 6459) at neutral CPU and equal memory, easing GC pressure when formatting many files. Fixtures, fuzz, and race checks all pass unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu

posTracker.advance counted newlines with strings.Count on every call, but profiling showed most calls advance by only 1–2 bytes (skipping a delimiter, bracket, or single character). At that size the SIMD-accelerated strings.Count is dominated by its own call and dispatch overhead — counting newlines in a one-byte string is almost all overhead. Scan spans of 16 bytes or fewer with an inline byte loop and reserve strings.Count for the larger runs (raw text, expression and comment bodies) where the SIMD path pays off. Line numbers are computed identically, so all fixtures, fuzz, and race checks pass unchanged. BenchmarkParseLarge improves from ~6.6ms to ~5.8ms (~12% faster, 60→69 MB/s); advance drops from ~18% to ~8% of parse CPU. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu

Once the token buffer was right-sized, profiling showed the remaining lexer cost was GC-related: the token stream is one large []token, and because token held two string fields (Lit, Raw) the whole buffer was full of pointers — so the GC scanned all ~4.5MB of it every cycle and every emit append carried write barriers (bulkBarrierPreWriteSrcOnly was ~24% of parse CPU at one point). Every lexed literal is a substring of the source — even the whitespace-trimmed bodies, since strings.TrimSpace/TrimRight/TrimSuffix return subslices — so a [offset,len) window loses no information. Store Lit and Raw as int32 offsets and recover the strings via Lit(src)/Raw(src) accessors. token is now pointer-free and shrinks from 64 to 48 bytes, so the buffer is never GC-scanned and appends need no write barriers. BenchmarkLexCorpus improves ~15% (63→73 MB/s) with 27% less memory; parse allocates ~20% less overall (ParseLarge 6.55MB→5.27MB) and emit falls from ~10% to ~2% of CPU. All fixtures stay byte-identical; race and both fuzz targets (FuzzLexer, FuzzParser) pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu

An 8.5MB html.test build artifact had been committed by accident; ignore *.test so it cannot recur. The binary itself is removed from this branch's history in the same push. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu

Trim the explanatory comments added with the parsing optimizations down to what the code needs — drop the embedded profiling narrative and restate the invariants concisely. Also simplify the tokTwigIdent emit guard: identEnd > identStart || identEnd > wsStart reduces to identEnd > wsStart since wsStart <= identStart. No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu

Soner (shyim) force-pushed the claude/pointer-based-ast-html-bm2jhg branch 2 times, most recently from 9f94895 to cb71d4a Compare July 1, 2026 07:50

Claude (claude) added 6 commits July 1, 2026 11:05

Soner (shyim) force-pushed the claude/pointer-based-ast-html-bm2jhg branch from cb71d4a to 99dafcd Compare July 1, 2026 11:06

Soner (shyim) marked this pull request as ready for review July 1, 2026 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize token storage with offset-based windows instead of strings#1147

Optimize token storage with offset-based windows instead of strings#1147
Soner (shyim) wants to merge 6 commits into
nextfrom
claude/pointer-based-ast-html-bm2jhg

Soner (shyim) commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Soner (shyim) commented Jul 1, 2026

Summary

Key Changes

Token Storage Optimization

Lexer Refactoring

Parser Updates

Position Tracking Optimization

Type Changes

Test Updates

Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants