Skip to content

Optimize token storage with offset-based windows instead of strings#1147

Open
Soner (shyim) wants to merge 6 commits into
nextfrom
claude/pointer-based-ast-html-bm2jhg
Open

Optimize token storage with offset-based windows instead of strings#1147
Soner (shyim) wants to merge 6 commits into
nextfrom
claude/pointer-based-ast-html-bm2jhg

Conversation

@shyim

Copy link
Copy Markdown
Member

Summary

Refactors the lexer's token representation to store literal and raw text as [offset, length) windows into the source string rather than as separate string allocations. This reduces memory overhead and GC pressure while maintaining the same functionality.

Key Changes

Token Storage Optimization

  • tokens.go: Changed token struct from storing Lit and Raw as strings to storing them as litOff, litLen, rawOff, rawLen int32 fields
    • Added Lit(src) and Raw(src) accessor methods to recover strings from offsets
    • Added LitLen() and RawLen() helper methods
    • Tokens are now pointer-free, eliminating GC scanning and write barriers on token slice appends

Lexer Refactoring

  • lexer.go:
    • Added trimSpaceWindow() function to compute trimmed text windows using Unicode semantics (matching strings.TrimSpace behavior) without allocating new strings
    • Added mkTok() helper to construct tokens from offset windows
    • Added span() helper for the common case where Lit and Raw are identical
    • Updated all token emission sites to use offset-based construction instead of string slicing
    • Adjusted token slice pre-allocation from len(src)/6+16 to len(src)/5+16 for better sizing

Parser Updates

  • parser.go:
    • Added nodeArena field to pack all collected child node lists end-to-end, replacing individual make() calls per list
    • Added attrSlab and newAttrNode() for attribute allocation pooling
    • Updated collect() to append scratch entries to the shared arena instead of allocating new slices
    • Pre-allocate arena and attribute slab based on token count estimates
    • Updated all token field accesses to use Lit(p.source) and Raw(p.source) accessors

Position Tracking Optimization

  • pos.go: Optimized advance() to count newlines inline for small spans (≤16 bytes) before falling back to strings.Count() for larger spans, reducing call overhead

Type Changes

  • ast.go: Changed Attribute to be stored as pointers in NodeLists (matching other node types)
  • Updated all fixer and checker code to use *Attribute type assertions and pointer construction

Test Updates

  • lexer_test.go: Updated test assertions to call Lit(src) accessor instead of accessing string field directly

Implementation Details

  • All lexed literals are substrings of the source, so no information is lost by storing offsets
  • The trimSpaceWindow() function precisely mirrors strings.TrimSpace semantics using Unicode character classification
  • Token slices remain GC-efficient since tokens are now pointer-free (no write barriers needed)
  • Node arena packing reduces allocations for typical templates with many small child lists
  • Backward compatible: accessor methods provide the same string interface as before

https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu

@shyim Soner (shyim) force-pushed the claude/pointer-based-ast-html-bm2jhg branch 2 times, most recently from 9f94895 to cb71d4a Compare July 1, 2026 07:50
Attribute values were boxed into the Node interface one malloc at a time
when appended to ElementNode.Attributes. Allocation profiling of the parse
benchmarks showed this single append site as the largest remaining
allocator (~39% of all objects), the last value-typed node in an otherwise
pointer-based AST.

Slab-allocate attributes like RawNode/ElementNode/TemplateExpressionNode
and store *Attribute in the NodeList, and give Attribute.Dump a pointer
receiver so the whole AST is uniformly pointer-based. The slab is pre-sized
from the measured ~1-attribute-per-32-tokens ratio and grows on demand.

Parsing a large concatenated corpus is ~18% faster with ~15% fewer
allocations; the small-file corpus is ~20% faster. Callers in
internal/verifier that type-assert or construct attributes are updated to
the pointer form.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
…arena

Two allocation hotspots surfaced by profiling the parse benchmarks after
the attribute change:

- The lexer pre-sized its token slice to len(src)/6, but the measured token
  density on real templates is ~5.7 source bytes per token, so the estimate
  fell just short and forced exactly one geometric grow — reallocating and
  write-barrier-copying the entire (~70k-entry) token array on every large
  parse. The code drifted from its own comment, which already prescribed
  len(src)/5. Restoring /5 removes that grow: BenchmarkParseLarge drops from
  ~11.7ms to ~6.5ms (~44% faster) and its allocated bytes from 11MB to 6.5MB.

- collect() allocated one exact-size NodeList per child list (every element's
  children, every block/if body) — the largest object allocator once
  attributes were slab-backed. Pack them end to end in a shared node arena and
  return capped subslices instead; a consumer appending to node.Children
  reallocates rather than clobbering the next list. Cuts BenchmarkParseLarge
  allocations ~46% (12009 to 6459) at neutral CPU and equal memory, easing GC
  pressure when formatting many files.

Fixtures, fuzz, and race checks all pass unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
posTracker.advance counted newlines with strings.Count on every call, but
profiling showed most calls advance by only 1–2 bytes (skipping a delimiter,
bracket, or single character). At that size the SIMD-accelerated strings.Count
is dominated by its own call and dispatch overhead — counting newlines in a
one-byte string is almost all overhead.

Scan spans of 16 bytes or fewer with an inline byte loop and reserve
strings.Count for the larger runs (raw text, expression and comment bodies)
where the SIMD path pays off. Line numbers are computed identically, so all
fixtures, fuzz, and race checks pass unchanged.

BenchmarkParseLarge improves from ~6.6ms to ~5.8ms (~12% faster, 60→69 MB/s);
advance drops from ~18% to ~8% of parse CPU.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
Once the token buffer was right-sized, profiling showed the remaining lexer
cost was GC-related: the token stream is one large []token, and because token
held two string fields (Lit, Raw) the whole buffer was full of pointers — so
the GC scanned all ~4.5MB of it every cycle and every emit append carried write
barriers (bulkBarrierPreWriteSrcOnly was ~24% of parse CPU at one point).

Every lexed literal is a substring of the source — even the whitespace-trimmed
bodies, since strings.TrimSpace/TrimRight/TrimSuffix return subslices — so a
[offset,len) window loses no information. Store Lit and Raw as int32 offsets and
recover the strings via Lit(src)/Raw(src) accessors. token is now pointer-free
and shrinks from 64 to 48 bytes, so the buffer is never GC-scanned and appends
need no write barriers.

BenchmarkLexCorpus improves ~15% (63→73 MB/s) with 27% less memory; parse
allocates ~20% less overall (ParseLarge 6.55MB→5.27MB) and emit falls from ~10%
to ~2% of CPU. All fixtures stay byte-identical; race and both fuzz targets
(FuzzLexer, FuzzParser) pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
An 8.5MB html.test build artifact had been committed by accident; ignore
*.test so it cannot recur. The binary itself is removed from this branch's
history in the same push.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
Trim the explanatory comments added with the parsing optimizations down to
what the code needs — drop the embedded profiling narrative and restate the
invariants concisely. Also simplify the tokTwigIdent emit guard:
identEnd > identStart || identEnd > wsStart reduces to identEnd > wsStart
since wsStart <= identStart. No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014hFsBmAJdopn5FqsejcGJu
@shyim Soner (shyim) force-pushed the claude/pointer-based-ast-html-bm2jhg branch from cb71d4a to 99dafcd Compare July 1, 2026 11:06
@shyim Soner (shyim) marked this pull request as ready for review July 1, 2026 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants