perf(html): speed up twig/HTML parser (~40% faster on large files)#1137
Merged
Conversation
Profiling showed the parser was allocation-bound: a single-threaded parse
spent ~half its wall-clock in GC. These changes cut allocations across the
lexer and parser without changing parse output (verified by the existing
round-trip tests and ~16M fuzz executions).
Lexer:
- Pre-size the token slice from measured token density (~1 token/6 bytes),
avoiding the geometric reallocations that dominated lexer allocations.
- Embed posTracker by value (drops a per-parse heap alloc) and count
newlines in bulk via strings.Count instead of byte-by-byte.
- Shrink the token struct 72->64 bytes by dropping Column from Pos
(24->16); column is derived lazily from the offset on the cold error
path only. The token stream is copied on every peek/advance/emit, so a
smaller token directly cuts CPU.
- Jump to the next '<'/'{' delimiter via strings.IndexAny (SIMD) instead
of inspecting every byte.
Parser:
- Lock-free lookupTag: the tag registry is populated entirely from init()
before any parse, so reads need no RWMutex; store *TagSpec to avoid a
per-lookup struct copy/escape.
- Drop strings.Fields in block-name parsing (allocated a slice to read
fields[0]); scan for the first word directly.
- rawSpan: accumulate raw text as zero-copy src[start:end] spans instead
of a strings.Builder, falling back to a Builder only on the (rare)
non-contiguous append.
- Per-type node slabs (RawNode/ElementNode/TemplateExpressionNode) sized
once from the token count, turning ~one mallocgc per node into a slab
bump.
- Scratch-stack node-list building: parseNodesUntil builds each list on a
single reused stack (mark/rewind) and collect() copies out an
exact-size NodeList, replacing one append-grown slice per list.
Benchmarks (single-threaded, benchstat n=8, vs baseline):
ParseLarge: -39% time, -60% B/op, -69% allocs/op
ParseCorpus: -19% time, -47% allocs/op
LexCorpus: -28% time, -84% allocs/op
Adds perf_bench_test.go (BenchmarkParseCorpus / LexCorpus / ParseLarge).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cd1a5c9750
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Twig/HTML parser in
internal/htmlwas allocation-bound: a single-threaded parse spent roughly half its wall-clock in GC. This PR cuts allocations across the lexer and parser, making large-file parsing ~40% faster with no change to parse output.Methodology note: the multi-core benchmark hid the GC cost (GC ran on spare cores). Measuring with
GOMAXPROCS=1revealed GC was ~50% of wall-clock — so allocation reduction, not micro-optimizing the byte loops, was the real lever. Every change below was kept only if it moved the benchmark; a few experiments were measured and rejected (see bottom).Benchmarks
Single-threaded (
GOMAXPROCS=1),benchstatn=8, vs the pre-PR baseline. Corpus is built fromtestdata(BenchmarkParseLargeis a ~400 KB concatenation that reflects per-byte cost without small-file call overhead).All
p=0.000. Run them with:Optimizations
Lexer
posTrackerby value (drops a per-parse heap alloc);advance()counts newlines in bulk withstrings.Countinstead of a byte-by-byte loop.token72 → 64 bytes by droppingColumnfromPos(24 → 16 bytes). Column is derived lazily from the offset (Pos.ColumnIn) on the cold error path only. The token stream is copied on everypeek/advance/emit, so a smaller token directly cuts CPU.</{viastrings.IndexAnyinstead of inspecting every byte.Parser
lookupTag: the tag registry is populated entirely frominit()before any parse, so reads need noRWMutex; storing*TagSpecalso avoids a per-lookup struct copy + escape.strings.Fieldsin block-name parsing (it allocated a slice just to readfields[0]); scan for the first word directly.rawSpan: accumulate raw text as zero-copysrc[start:end]spans instead of astrings.Builder, falling back to a Builder only on the rare non-contiguous append.RawNode/ElementNode/TemplateExpressionNode) sized once from the token count — turns ~onemallocgcper node into a slab bump.parseNodesUntilbuilds each list on a single reused stack (mark/rewind) andcollect()copies out an exact-sizeNodeList, replacing one append-grown slice per list (the largest remaining allocator in the profile).Correctness
internal/htmltests pass (these include round-trip / formatter tests).internal/verifier/...(admin + storefront Twig linters).go vet ./internal/html/...clean.FuzzParserandFuzzLexerwith no crashes or round-trip mismatches.Note:
Posis internal to the package (not referenced elsewhere), so droppingPos.Columnhas no external API impact.Experiments measured and rejected
fereidani/arenaLocalArena[T]— benchmarked head-to-head; statistically tied on large files (p=0.065) and +27% allocs on small files, for a third-party dependency. It's a chunked struct allocator — the same idea as the hand-rolled slabs — and the remaining bottleneck is[]Nodeslice allocation, which an arena-of-structs doesn't address.Remaining ceiling (out of scope)
There's still a ~2× gap to the GC-off floor single-threaded. It's now fundamental to the pointer-based AST — every node is a heap pointer the GC must scan. Closing it would require a flat, index-based node representation (no
Nodeinterface, no per-node pointers), a large rewrite touchingformat.go,TraverseNode, and all linters. Flagged for a future effort.