Skip to content

perf(html): speed up twig/HTML parser (~40% faster on large files)#1137

Merged
Soner (shyim) merged 2 commits into
nextfrom
perf/twig-parser-optimizations
Jun 30, 2026
Merged

perf(html): speed up twig/HTML parser (~40% faster on large files)#1137
Soner (shyim) merged 2 commits into
nextfrom
perf/twig-parser-optimizations

Conversation

@shyim

Copy link
Copy Markdown
Member

Summary

The Twig/HTML parser in internal/html was allocation-bound: a single-threaded parse spent roughly half its wall-clock in GC. This PR cuts allocations across the lexer and parser, making large-file parsing ~40% faster with no change to parse output.

Methodology note: the multi-core benchmark hid the GC cost (GC ran on spare cores). Measuring with GOMAXPROCS=1 revealed GC was ~50% of wall-clock — so allocation reduction, not micro-optimizing the byte loops, was the real lever. Every change below was kept only if it moved the benchmark; a few experiments were measured and rejected (see bottom).

Benchmarks

Single-threaded (GOMAXPROCS=1), benchstat n=8, vs the pre-PR baseline. Corpus is built from testdata (BenchmarkParseLarge is a ~400 KB concatenation that reflects per-byte cost without small-file call overhead).

Benchmark Time Throughput Bytes/op Allocs/op
ParseLarge −39% 36 → 60 MB/s −60% (26 → 10.4 MB) −69% (46.2k → 14.2k)
ParseCorpus (small files) −19% +23% −10% −47% (1445 → 770)
LexCorpus −28% +39% −40% −84% (435 → 71)

All p=0.000. Run them with:

go test ./internal/html -run='^$' -bench='BenchmarkParseLarge|BenchmarkParseCorpus|BenchmarkLexCorpus' -benchmem

Optimizations

Lexer

  • Pre-size the token slice from measured token density (~1 token / 6 source bytes) — eliminates the geometric reallocations that dominated lexer allocations.
  • Embed posTracker by value (drops a per-parse heap alloc); advance() counts newlines in bulk with strings.Count instead of a byte-by-byte loop.
  • Shrink token 72 → 64 bytes by dropping Column from Pos (24 → 16 bytes). Column is derived lazily from the offset (Pos.ColumnIn) on the cold error path only. The token stream is copied on every peek/advance/emit, so a smaller token directly cuts CPU.
  • SIMD delimiter scan: jump to the next </{ via strings.IndexAny instead of inspecting every byte.

Parser

  • Lock-free lookupTag: the tag registry is populated entirely from init() before any parse, so reads need no RWMutex; storing *TagSpec also avoids a per-lookup struct copy + escape.
  • Drop strings.Fields in block-name parsing (it allocated a slice just to read fields[0]); scan for the first word directly.
  • rawSpan: accumulate raw text as zero-copy src[start:end] spans instead of a strings.Builder, falling back to a Builder only on the rare non-contiguous append.
  • Per-type node slabs (RawNode/ElementNode/TemplateExpressionNode) sized once from the token count — turns ~one mallocgc per node into a slab bump.
  • Scratch-stack node-list building: parseNodesUntil builds each list on a single reused stack (mark/rewind) and collect() copies out an exact-size NodeList, replacing one append-grown slice per list (the largest remaining allocator in the profile).

Correctness

  • Existing internal/html tests pass (these include round-trip / formatter tests).
  • Consumers pass: internal/verifier/... (admin + storefront Twig linters).
  • go vet ./internal/html/... clean.
  • ~16M fuzz executions across FuzzParser and FuzzLexer with no crashes or round-trip mismatches.

Note: Pos is internal to the package (not referenced elsewhere), so dropping Pos.Column has no external API impact.

Experiments measured and rejected

  • Chunked node arena (fixed 64-node chunks) — tripled bytes on small files (allocated full chunks for tiny inputs). Replaced with token-count-sized slabs.
  • fereidani/arena LocalArena[T] — benchmarked head-to-head; statistically tied on large files (p=0.065) and +27% allocs on small files, for a third-party dependency. It's a chunked struct allocator — the same idea as the hand-rolled slabs — and the remaining bottleneck is []Node slice allocation, which an arena-of-structs doesn't address.

Remaining ceiling (out of scope)

There's still a ~2× gap to the GC-off floor single-threaded. It's now fundamental to the pointer-based AST — every node is a heap pointer the GC must scan. Closing it would require a flat, index-based node representation (no Node interface, no per-node pointers), a large rewrite touching format.go, TraverseNode, and all linters. Flagged for a future effort.

Profiling showed the parser was allocation-bound: a single-threaded parse
spent ~half its wall-clock in GC. These changes cut allocations across the
lexer and parser without changing parse output (verified by the existing
round-trip tests and ~16M fuzz executions).

Lexer:
- Pre-size the token slice from measured token density (~1 token/6 bytes),
  avoiding the geometric reallocations that dominated lexer allocations.
- Embed posTracker by value (drops a per-parse heap alloc) and count
  newlines in bulk via strings.Count instead of byte-by-byte.
- Shrink the token struct 72->64 bytes by dropping Column from Pos
  (24->16); column is derived lazily from the offset on the cold error
  path only. The token stream is copied on every peek/advance/emit, so a
  smaller token directly cuts CPU.
- Jump to the next '<'/'{' delimiter via strings.IndexAny (SIMD) instead
  of inspecting every byte.

Parser:
- Lock-free lookupTag: the tag registry is populated entirely from init()
  before any parse, so reads need no RWMutex; store *TagSpec to avoid a
  per-lookup struct copy/escape.
- Drop strings.Fields in block-name parsing (allocated a slice to read
  fields[0]); scan for the first word directly.
- rawSpan: accumulate raw text as zero-copy src[start:end] spans instead
  of a strings.Builder, falling back to a Builder only on the (rare)
  non-contiguous append.
- Per-type node slabs (RawNode/ElementNode/TemplateExpressionNode) sized
  once from the token count, turning ~one mallocgc per node into a slab
  bump.
- Scratch-stack node-list building: parseNodesUntil builds each list on a
  single reused stack (mark/rewind) and collect() copies out an
  exact-size NodeList, replacing one append-grown slice per list.

Benchmarks (single-threaded, benchstat n=8, vs baseline):
  ParseLarge:  -39% time, -60% B/op, -69% allocs/op
  ParseCorpus: -19% time, -47% allocs/op
  LexCorpus:   -28% time, -84% allocs/op

Adds perf_bench_test.go (BenchmarkParseCorpus / LexCorpus / ParseLarge).
@shyim Soner (shyim) marked this pull request as ready for review June 30, 2026 02:45
@shyim Soner (shyim) merged commit 44451af into next Jun 30, 2026
2 checks passed
@shyim Soner (shyim) deleted the perf/twig-parser-optimizations branch June 30, 2026 02:45

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd1a5c9750

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread internal/html/tags.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants