perf(html): speed up twig/HTML parser (~40% faster on large files) by shyim · Pull Request #1137 · shopware/shopware-cli

Soner (shyim) · 2026-06-29T07:32:46Z

Summary

The Twig/HTML parser in internal/html was allocation-bound: a single-threaded parse spent roughly half its wall-clock in GC. This PR cuts allocations across the lexer and parser, making large-file parsing ~40% faster with no change to parse output.

Methodology note: the multi-core benchmark hid the GC cost (GC ran on spare cores). Measuring with GOMAXPROCS=1 revealed GC was ~50% of wall-clock — so allocation reduction, not micro-optimizing the byte loops, was the real lever. Every change below was kept only if it moved the benchmark; a few experiments were measured and rejected (see bottom).

Benchmarks

Single-threaded (GOMAXPROCS=1), benchstat n=8, vs the pre-PR baseline. Corpus is built from testdata (BenchmarkParseLarge is a ~400 KB concatenation that reflects per-byte cost without small-file call overhead).

Benchmark	Time	Throughput	Bytes/op	Allocs/op
ParseLarge	−39%	36 → 60 MB/s	−60% (26 → 10.4 MB)	−69% (46.2k → 14.2k)
ParseCorpus (small files)	−19%	+23%	−10%	−47% (1445 → 770)
LexCorpus	−28%	+39%	−40%	−84% (435 → 71)

All p=0.000. Run them with:

go test ./internal/html -run='^$' -bench='BenchmarkParseLarge|BenchmarkParseCorpus|BenchmarkLexCorpus' -benchmem

Optimizations

Lexer

Pre-size the token slice from measured token density (~1 token / 6 source bytes) — eliminates the geometric reallocations that dominated lexer allocations.
Embed posTracker by value (drops a per-parse heap alloc); advance() counts newlines in bulk with strings.Count instead of a byte-by-byte loop.
Shrink token 72 → 64 bytes by dropping Column from Pos (24 → 16 bytes). Column is derived lazily from the offset (Pos.ColumnIn) on the cold error path only. The token stream is copied on every peek/advance/emit, so a smaller token directly cuts CPU.
SIMD delimiter scan: jump to the next </{ via strings.IndexAny instead of inspecting every byte.

Parser

Lock-free lookupTag: the tag registry is populated entirely from init() before any parse, so reads need no RWMutex; storing *TagSpec also avoids a per-lookup struct copy + escape.
Drop strings.Fields in block-name parsing (it allocated a slice just to read fields[0]); scan for the first word directly.
rawSpan: accumulate raw text as zero-copy src[start:end] spans instead of a strings.Builder, falling back to a Builder only on the rare non-contiguous append.
Per-type node slabs (RawNode/ElementNode/TemplateExpressionNode) sized once from the token count — turns ~one mallocgc per node into a slab bump.
Scratch-stack node-list building: parseNodesUntil builds each list on a single reused stack (mark/rewind) and collect() copies out an exact-size NodeList, replacing one append-grown slice per list (the largest remaining allocator in the profile).

Correctness

Existing internal/html tests pass (these include round-trip / formatter tests).
Consumers pass: internal/verifier/... (admin + storefront Twig linters).
go vet ./internal/html/... clean.
~16M fuzz executions across FuzzParser and FuzzLexer with no crashes or round-trip mismatches.

Note: Pos is internal to the package (not referenced elsewhere), so dropping Pos.Column has no external API impact.

Experiments measured and rejected

Chunked node arena (fixed 64-node chunks) — tripled bytes on small files (allocated full chunks for tiny inputs). Replaced with token-count-sized slabs.
fereidani/arena LocalArena[T] — benchmarked head-to-head; statistically tied on large files (p=0.065) and +27% allocs on small files, for a third-party dependency. It's a chunked struct allocator — the same idea as the hand-rolled slabs — and the remaining bottleneck is []Node slice allocation, which an arena-of-structs doesn't address.

Remaining ceiling (out of scope)

There's still a ~2× gap to the GC-off floor single-threaded. It's now fundamental to the pointer-based AST — every node is a heap pointer the GC must scan. Closing it would require a flat, index-based node representation (no Node interface, no per-node pointers), a large rewrite touching format.go, TraverseNode, and all linters. Flagged for a future effort.

Profiling showed the parser was allocation-bound: a single-threaded parse spent ~half its wall-clock in GC. These changes cut allocations across the lexer and parser without changing parse output (verified by the existing round-trip tests and ~16M fuzz executions). Lexer: - Pre-size the token slice from measured token density (~1 token/6 bytes), avoiding the geometric reallocations that dominated lexer allocations. - Embed posTracker by value (drops a per-parse heap alloc) and count newlines in bulk via strings.Count instead of byte-by-byte. - Shrink the token struct 72->64 bytes by dropping Column from Pos (24->16); column is derived lazily from the offset on the cold error path only. The token stream is copied on every peek/advance/emit, so a smaller token directly cuts CPU. - Jump to the next '<'/'{' delimiter via strings.IndexAny (SIMD) instead of inspecting every byte. Parser: - Lock-free lookupTag: the tag registry is populated entirely from init() before any parse, so reads need no RWMutex; store *TagSpec to avoid a per-lookup struct copy/escape. - Drop strings.Fields in block-name parsing (allocated a slice to read fields[0]); scan for the first word directly. - rawSpan: accumulate raw text as zero-copy src[start:end] spans instead of a strings.Builder, falling back to a Builder only on the (rare) non-contiguous append. - Per-type node slabs (RawNode/ElementNode/TemplateExpressionNode) sized once from the token count, turning ~one mallocgc per node into a slab bump. - Scratch-stack node-list building: parseNodesUntil builds each list on a single reused stack (mark/rewind) and collect() copies out an exact-size NodeList, replacing one append-grown slice per list. Benchmarks (single-threaded, benchstat n=8, vs baseline): ParseLarge: -39% time, -60% B/op, -69% allocs/op ParseCorpus: -19% time, -47% allocs/op LexCorpus: -28% time, -84% allocs/op Adds perf_bench_test.go (BenchmarkParseCorpus / LexCorpus / ParseLarge).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd1a5c9750

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Soner (shyim) added 2 commits June 29, 2026 09:32

fix(html): collapse else-if in lexer boundary scan (gocritic)

cd1a5c9

Soner (shyim) marked this pull request as ready for review June 30, 2026 02:45

Soner (shyim) merged commit 44451af into next Jun 30, 2026
2 checks passed

Soner (shyim) deleted the perf/twig-parser-optimizations branch June 30, 2026 02:45

chatgpt-codex-connector Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread internal/html/tags.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(html): speed up twig/HTML parser (~40% faster on large files)#1137

perf(html): speed up twig/HTML parser (~40% faster on large files)#1137
Soner (shyim) merged 2 commits into
nextfrom
perf/twig-parser-optimizations

Soner (shyim) commented Jun 29, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Soner (shyim) commented Jun 29, 2026

Summary

Benchmarks

Optimizations

Correctness

Experiments measured and rejected

Remaining ceiling (out of scope)

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants