Skip to content

feat(tailcalls): opt-in return_call codegen via --experimental-tailcalls#6043

Open
ggreif wants to merge 39 commits into
masterfrom
gabor/wasm-exts-sync
Open

feat(tailcalls): opt-in return_call codegen via --experimental-tailcalls#6043
ggreif wants to merge 39 commits into
masterfrom
gabor/wasm-exts-sync

Conversation

@ggreif
Copy link
Copy Markdown
Contributor

@ggreif ggreif commented Apr 22, 2026

Summary

End-to-end opt-in support for wasm tail calls (return_call / return_call_indirect AST in wasm-exts; TailCallPrim IR; producer pass; codegen for direct calls), gated behind the new --experimental-tailcalls flag. With the flag on, all tail calls — both self-recursive and cross-function — lower to wasm return_call, giving bounded stack for mutually-recursive / VM-dispatcher-shaped code at a small cycle benefit relative to the existing self-tail-call → loop rewrite. With the flag off, current behaviour is preserved exactly.

This PR was originally scoped as a broader wasm-exts sync to upstream wasm-2.0.2 (SIMD, reference types, multi-memory, …). After confirming that tail calls are not in wasm-2.0.2 (the OCaml interpreter at that release predates the proposal merge — see .claude/plans/wasm-exts-update.md), the tail-call work was carved out as its own self-contained slice on top of the existing wasm-exts. The broader 2.0.2 sync remains future work.

What's delivered

layer commit what
wasm-exts f2c69f3e7 ReturnCall / ReturnCallIndirect AST variants, smart constructors, encoder (opcodes 0x12 / 0x13), decoder (opcodes removed from the illegal-list)
CLI a7ebc369d --experimental-tailcalls flag wiring in flags.ml + moc.ml (no behaviour change yet)
IR dd2d8337a TailCallPrim of Type.typ list declaration + exhaustive plumbing through interpreter, type-checker, effect inference, async lowering, both backends
codegen ef2c7fb0c compile_classical.ml and compile_enhanced.ml direct-call arm: is_tail derived from prim, emits ReturnCall <fi> and skips trailing FakeMultiVal.load when the prim is TailCallPrim
producer 4dbdb1712 tailcall.ml gains a third arm: in tail position, when the self-recursion loop-rewrite doesn't apply and --experimental-tailcalls is set, emit TailCallPrim instead of CallPrim
Shared-fn fix 6491d605c tailcall.ml FuncE arm gates tail_pos = true on s = Type.Local. Shared bodies are descended with tail_pos = false, so the producer cannot label the wrapper-to-body call as tail-position. Fixes a cleanup-bypass trap that surfaced once gauss exercised the post-update-message path. Also passes --enable-tail-call to wasm-validate in the test-runner (test-runner/src/run_test.rs).
flag-gated loop elision e84769da5 tailcall.ml self-tail-call → loop rewrite is now gated not !Flags.experimental_tailcalls. With the flag on, self-tail calls fall through to TailCallPrim and become return_call; with the flag off, the legacy loop rewrite still fires.
bench fb38c1b24, e24ebc003, 6491d605c, e84769da5 test/bench/tailcall.mo: Hutton/Bahr-style stack VM running fak (mutual TCO) + naïve self-recursive foldLeft running 5-Year-Old Gauss (loop-rewrite vs return_call). Self-documents via //MOC-FLAG --experimental-tailcalls.

Empirical (EOP, --experimental-tailcalls on)

benchmark shape cycles
hutton mutual-tail-recursive VM dispatcher, fak(10) × 1_000 25_416_088
gauss self-tail-recursive foldLeft (+) 0 [1..100] × 10_000 101_400_270

Comparisons (same tree, same EOP setting, only the flag flipped):

  • hutton (mutual): −636_000 cycles (~2.5%) vs call. Pure cost-table delta (Call=5 → ReturnCall=3 × dispatch count). Also bounded-stack — VM dispatchers can run forever without growing the wasm stack.
  • gauss (self-tail): −9_120_000 cycles (~8.2%) vs the loop rewrite. The loop rewrite has its own overhead (mutable arg-temps + loop/block wrapper) that return_call skips. Removing the loop rewrite under the flag is cheaper, not just stack-bounded.

Design notes

  • Pipeline order matters. tailcall_optimization runs after async_lowering (pipeline.ml:796-797), so by the time the producer arm sees a CallPrim, awaitable / IC calls have already been desugared. The producer therefore cannot mistakenly tag a shared call as a tail call.
  • Codegen scope: direct calls only. TailCallPrim is honoured in the SR.Const Const.Fun (..., mk_fi, _) arm (known function index → emit ReturnCall <fi>). Closure calls (Type.Local via call_indirect) and shared calls fall through to the regular path even if the prim is TailCallPrim. Extending to return_call_indirect for computed tail-calls is described in the plan as a follow-up.
  • Validation argument. Wasm return_call requires the callee's wasm result type to match the enclosing function's. Motoko's all-multival-via-side-channel ABI gives every wasm function the result type [] (or a single i64), so the constraint holds trivially for direct calls between Motoko functions.
  • Shared-function bodies excluded. A Motoko public func foo() : async () = body compiles to a wasm wrapper of the form message_start ; user-body ; message_cleanup (state-machine transition + GC). The body is not in tail position from the wasm wrapper's perspective. Letting return_call escape the body would skip the cleanup, leaving the lifecycle stuck at InUpdate — the next update message then traps with 'internal error: unexpected state entering InUpdate'. The fix is to descend Shared FuncE bodies with tail_pos = false.

Future work (scoped out, kept in the plan)

  • Broader wasm-exts 2.0.2 sync — the original ambition of this branch (SIMD, reference types, multi-memory, exception handling, …). Tail calls are not in 2.0.2 anyway, so the slice was carved out independently.
  • Source-level annotation (with tailcall) — per-call surface knob (presence-form via record-punning on a let tailcall = true) with a compile-time Bool-const constraint and a tail-position warning. Useful workaround for mo:core recursive algorithms.
  • return_call_indirect for computed tail-calls — VM dispatchers that index into a handler table, CPS continuations, dynamic dispatch in interpreters. Requires extending Closure.call_closure. The wasm-exts AST/codec already supports it; codegen lowering does not.

See .claude/plans/wasm-exts-update.md for the full design treatment, including open design questions for the annotation and the IC pricing measurements that informed the empirical comparison.

Related

🤖 Generated with Claude Code

@ggreif ggreif self-assigned this Apr 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Comparing from f8ad7b0 to a3ff621:
The produced WebAssembly code seems to be completely unchanged.
In terms of gas, no changes are observed in 5 tests.
In terms of size, no changes are observed in 5 tests.

@ggreif ggreif changed the title Syncing of wasm-exts to provide new instructions (e.g. SIMD!) chore: syncing of wasm-exts to provide new instructions (e.g. SIMD!) Apr 22, 2026
ggreif added a commit that referenced this pull request Apr 22, 2026
Moved to PR #6043 (wasm-exts sync), which will be the natural home for
constant-tracking reasoning as new instructions come in. No content
loss — the file is reproduced verbatim on gabor/wasm-exts-sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the plan file here from PR #5961's branch where it sat as a
leftover companion doc. Outlines the forthcoming `wasm-exts` sync
work — pulling upstream instruction support forward so codegen can
emit SIMD, newer ref-types, etc. This PR is the natural home.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif ggreif force-pushed the gabor/wasm-exts-sync branch from c3b35dc to 35f58d0 Compare April 22, 2026 11:17
ggreif added a commit that referenced this pull request Apr 22, 2026
Moved to PR #6043 (wasm-exts sync), which is the natural home for the
upcoming instruction-catchup work. No content loss — file is
reproduced verbatim on gabor/wasm-exts-sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ggreif and others added 11 commits May 4, 2026 17:56
Adds AST variants, smart constructors, encoder (opcodes 0x12 / 0x13),
and decoder cases for the tail-call proposal. The corresponding slots
are no longer in the illegal-opcode list of the binary decoder.

Unblocks `Tailcall.transform`'s standing TODO at
\`src/ir_passes/tailcall.ml:13-14\` ("can easily be extended to
non-self tail calls, once supported by wasm") — the AST level can
now express what the optimiser would emit. \`pocket-ic\` runtime
support is the remaining open question (filed in PR #6043 thread).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ine)

A small stack VM whose dispatcher is mutually tail-recursive: `step`
matches the opcode and tail-calls a per-opcode handler (`opPush`,
`opMul`, …); each handler tail-calls `step` again. Today's
`Tailcall` IR pass only rewrites *self* tail-calls into loops, so each
cross-function hop currently allocates a fresh frame.

This is the textbook shape for general TCO: first-order, no closures,
mutually tail-recursive. Once the optimiser is extended to emit
`return_call` for non-self tail calls (TODO at \`tailcall.ml:13-14\`,
unblocked by wasm-exts \`ReturnCall\` landing on this branch), the
diff against the committed cycle count will be the measurement.

Baseline (1000 iterations of fak(10)): 15_439_607 cycles ≈ 15.4k/iter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Just the flag, currently a no-op. IR primitive (\`TailCallPrim\`) and
backend codegen (emitting \`return_call\` / \`return_call_indirect\`)
land in follow-up commits — wasm-exts already grew the AST level
support in \`f2c69f3e7\`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new \`prim\` constructor \`TailCallPrim of Type.typ list\` as a
sibling of \`CallPrim\`, marking a function call that occurs in tail
position (cf. wasm \`return_call\`). For now nothing produces it; this
commit only wires the variant exhaustively through the consumers so
the IR type and all passes remain total:

- \`ir.ml\`: declaration + \`t_typ\` traversal
- \`arrange_ir.ml\`: pretty-print as "TailCallPrim"
- \`ir_effect.ml\`, \`check_ir.ml\`, \`interpret_ir.ml\`: handled
  identically to \`CallPrim\` via or-patterns
- \`async.ml\`: tail-calls to awaitable functions fall into the same
  async-lowering arm (defensive — a producer should not emit them,
  but if it did, the lowering would still be correct)
- \`compile_classical.ml\`, \`compile_enhanced.ml\`: both backends
  treat \`TailCallPrim\` like \`CallPrim\`; backend specialisation to
  emit \`return_call\` lands next.

The \`tailcall.ml\` IR pass needs no change yet — its catch-all
\`PrimE (p, es)\` arm handles \`TailCallPrim\` correctly (children
descended in non-tail context, no rewrite). The producer extension
that converts non-self tail-positioned \`CallPrim\` into
\`TailCallPrim\` (gated by \`--experimental-tailcalls\`) is the next
step, alongside the backend codegen.

\`test/bench/tailcall.mo\` baseline cycles unchanged (verified
against the committed .ok), confirming this is a pure plumbing
commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both backends now branch on the prim variant in the call arm. When the
prim is \`TailCallPrim\` and the callee is a known function index
(\`SR.Const Const.Fun (..., mk_fi, _)\`), emit \`ReturnCall <fi>\` and
omit the trailing \`FakeMultiVal.load\` (control never returns to that
point). Otherwise — closure calls (\`Type.Local\` via \`call_indirect\`)
and shared/IC calls — fall through to the regular call path even when
the prim is \`TailCallPrim\`; the producer should not emit those, but
falling through is semantics-preserving if it does.

Validation note: wasm \`return_call\` requires the callee's result type
to match the enclosing function's. Motoko's all-multival-via-side-channel
ABI gives every wasm function the result type \`[]\`, so the constraint
is trivially satisfied for direct calls between Motoko functions.

Dormant for now: the \`tailcall.ml\` IR pass does not yet produce
\`TailCallPrim\`. The bench cycle count is unchanged (verified against
the committed .ok). The producer extension — emit \`TailCallPrim\` for
non-self tail-positioned calls when \`--experimental-tailcalls\` is set
— is the final piece.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the \`Tailcall\` pass with a third arm. Behaviour:

- Self tail-call with matching type instantiation → loop rewrite
  (existing path; strictly cheaper than \`return_call\` for self).
- Otherwise, in tail position, with \`--experimental-tailcalls\` set →
  emit \`PrimE (TailCallPrim insts, ...)\`. The codegen path already
  in place lowers it to wasm \`return_call\`.
- Else → ordinary \`CallPrim\` (no behaviour change).

The pipeline order \`async_lowering ; tailcall_optimization\`
(\`pipeline.ml:796-797\`) means awaitable calls have already been
desugared by the time we see them, so this arm cannot tag a shared
or local-async call as \`TailCallPrim\`.

Measured against \`test/bench/tailcall.mo\` (Hutton VM, fak 10 ×1000):

  baseline (no flag):          15_439_607 cycles
  with --experimental-tailcalls: 25_416_088 cycles

So this is currently a *regression* on the IC instruction counter
(~65% more cycles per mutual dispatch hop). The codegen is correct
(\`fak10 = 3_628_800\` in both). The cause is not \`moc\`-side: the
emitted \`return_call\` replaces a \`call\` with no surrounding code
change, yet the IC's wasmtime/metering charges \`return_call\` more
heavily than \`call\` followed by an implicit return. So the
infrastructure is ready, but the actual win has to wait on either
runtime/metering improvements or selective producer heuristics.

Default behaviour is unchanged — the flag is opt-in. The committed
bench .ok stays at the baseline number.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "What We Do NOT Need (initially)" bullet for tail calls is now
out of date — the slice was delivered ahead of the broader 2.0.2
sync. Flips the bullet and adds a § *Tail-call instructions
(delivered)* covering the commit chain, the design choices
(pipeline ordering, direct-call-only scope, validation argument,
why self-tail still loops), and the IC instruction-counter
finding (\`return_call\` is ~65% pricier per hop, so the flag is a
bounded-stack opt-in rather than a perf knob).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Records the design idea for a per-call source-level opt-in. Captures:

- Motivation: per-call granularity matches the property
  \`return_call\` provides (bounded stack at metered cost); equally
  valuable as a declarative diagnostic à la Scala \`@tailrec\`. Near-
  term workaround for \`mo:core\` recursive algorithms while the IC's
  per-instruction \`return_call\` cost remains elevated.
- Surface syntax: both \`(with tailcall = true)\` and the punning
  \`(with tailcall)\` form, neither requiring grammar changes.
- Typecheck constraint: must be compile-time-known Bool.
- Lowering: bypass the flag-gated producer arm, go straight to
  \`TailCallPrim\` and reuse existing codegen.
- Four open design questions: flag interaction, where to run the
  tail-position warning, self-recursion behaviour, indirect calls.

Marked *proposed* (not delivered). Existing "Future work" bullets
restructured so the annotation is the headline next step and
"producer heuristics" is explicitly subsumed by it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ail-calls

Reframes the indirect-tail-call follow-up around its actual use case —
*computed* tail-calls, where the callee is a value chosen at runtime
rather than a statically-known name. That's the complement of the
delivered direct-call path: VM dispatchers that index into a handler
table, CPS-transformed continuations, dynamic-dispatch interpreters.

Section covers:

- Motivation: what direct \`return_call\` *can't* do.
- Codegen plumbing: \`Closure.call_closure\` gains an \`is_tail\`
  parameter; both backends touched symmetrically.
- Wasm validation precondition: type-table entry's result type must
  match the enclosing function's. Motoko's uniform ABI satisfies this
  in practice but emission must enforce it explicitly.
- Bench coverage: a sibling \`tailcall-computed.mo\` storing opcode
  handlers as closures isolates the indirect-call premium against the
  existing direct-call bench.
- Cycle-cost expectation: unknown without measurement; \`call_indirect\`
  is already pricier than \`call\`, so the relative tail-call premium
  may differ.
- Interaction with \`(with tailcall)\`: the annotation's "direct only"
  warning becomes unnecessary once this lands.

Existing "Future work" bullet collapses to a forward reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/codegen/compile_classical.ml Outdated
Comment thread src/exes/moc.ml Outdated
ggreif and others added 6 commits May 5, 2026 11:22
…lls\`

Folds the experimental-tailcalls flag directly into the source so the
bench unconditionally measures the TCO codegen path — what it's named
for. Re-accepts the .ok against the EOP+TCO baseline:

  cycles = 25_416_088   (was 15_439_607)

The earlier 15.4M number came from running before the producer +
backend codegen landed AND with run-test's classical-persistence
default; the new 25.4M is EOP + TCO, matching what \`make tailcall.only\`
in this repo's bench Makefile actually exercises.

Side-effect of TCO codegen: wasm-validate doesn't know \`return_call\`
(opcode 0x12), so its expected output is the error itself (committed
as \`tailcall.valid.ok\` / \`.valid.ret.ok\`). Could pass
\`--enable-tail-call\` to wasm-validate in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Empirical: IC instruction-counter cost" section said the flag
caused a ~65% cycle regression. That was a methodology error: the
"baseline" 15.4M came from a committed .ok captured before the
producer + backend landed, and \`run-test\`'s default persistence
mode (classical) differs from the bench Makefile's (EOP).

Honest same-tree, same-EOP-setting comparison:

| build | cycles |
| --- | --- |
| no \`--experimental-tailcalls\` | 26_052_088 |
| \`--experimental-tailcalls\` | 25_416_088 |
| delta | -636_000 (TCO cheaper, ~2.5%) |

Reframes the section: TCO is mildly *cheaper* on the cycle axis, in
agreement with the cost table (Call=5, ReturnCall=3). The flag's
primary value remains bounded-stack guarantees for mutual recursion
/ VM dispatchers / CPS, with the cycle reduction as a small bonus.

Bench .ok was relocked to the with-flag EOP baseline in \`e24ebc003\`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…\` bench

The \`FuncE\` arm in \`tailcall.ml\` set \`tail_pos = true\` for every
function body unconditionally. For Shared functions (post async-
lowering: \`Shared+Replies\` for awaitables, \`Shared+Returns\` for
one-shot oneway updates) that's wrong: the wasm-level wrapper has
\`message_start ; user-body ; message_cleanup\` (state-machine
transition + GC) — cleanup runs *below* the body, so the body is
not in tail position from the wasm function's perspective.

With \`--experimental-tailcalls\` set, the producer arm would emit
\`TailCallPrim\` for the body's last call and codegen would lower
to \`return_call \$\$lambda.N\`, bypassing the cleanup. The lifecycle
state stays \`InUpdate\`, and the next message traps with
"internal error: unexpected state entering InUpdate" when the
runtime tries \`Idle → InUpdate\` again.

Fix: gate \`tail_pos\` on \`s = Type.Local\`. Shared bodies are
descended with \`tail_pos = false\` (via \`exp\` instead of
\`tailexp\`), so the producer can no longer label the wrapper-to-
body call as tail-position. Existing self-tail-recursion → loop
rewrite for Local functions is preserved unchanged.

Also adds the \`gauss\` bench: naïve self-recursive \`foldLeft (+) 0\`
over [1..100] × 10_000. Complements the existing VM bench by
exercising the loop-rewrite path (\`tailcall.ml:185-200\`) — today
both flag-on and flag-off compile foldLeft to a wasm \`loop\`. When
the loop-rewrite is later removed in favour of uniform
\`return_call\` codegen, the cycle delta on \`gauss\` will be the
load-bearing measurement.

Pass \`--enable-tail-call\` to \`wasm-validate\` in the test runner so
benches that emit \`return_call\` no longer trip on the validator
side. Drops the \`tailcall.valid.ok\` / \`.valid.ret.ok\` files that
were capturing the validator's \`unexpected opcode: 0x12\` error —
no longer needed.

Bench numbers (EOP, \`--experimental-tailcalls\`):

  go    (VM, mutual TCO):           25_416_088 cycles / 1_000 iter
  gauss (loop-rewritten foldLeft): 110_520_270 cycles / 10_000 iter

The \`gauss\` bench is what surfaced the codegen bug above — without
it the trap would not have fired (the existing \`go\` alone runs as
the *first* update message and hits the cleanup-bypass invisibly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tailcalls\`

When the flag is on, the existing self-tail-call → loop rewrite is
skipped, so self-recursive tail calls fall through to the
\`TailCallPrim\` arm and codegen emits \`return_call\` instead of the
\`loop { … local.set; br 0 }\` machinery. When the flag is off, the
loop rewrite still fires (current default behaviour preserved).

Measured on \`test/bench/tailcall.mo\` \`gauss\` (naïve self-recursive
\`foldLeft (+) 0 [1..100]\` × 10_000):

  loop-rewrite (flag off / today): 110_520_270 cycles
  return_call  (flag on, new):     101_400_270 cycles
  delta:                            -9_120_000 (~8.2% cheaper)

The loop rewrite has its own overhead — it copies args into mutable
temps and adds a \`loop\` / \`block\` wrapper — and per the IC's cost
table, \`return_call\` (3) plus the wasm-level arg-passing it would do
anyway beats that overhead in this benchmark. So removing the loop
rewrite in favour of uniform \`return_call\` codegen is genuinely
cheaper, not just stack-bounded.

The flag now means: "use \`return_call\` for ALL tail calls, both
self-tail and cross-function." Loop rewrite remains as the legacy
fallback for flag-off.

Bench .ok re-locked at \`gauss = 101_400_270\` (with-flag baseline);
\`go\` unchanged at \`25_416_088\`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif ggreif changed the title chore: syncing of wasm-exts to provide new instructions (e.g. SIMD!) feat(tailcalls): opt-in return_call codegen via --experimental-tailcalls May 5, 2026
ggreif and others added 2 commits May 5, 2026 13:01
…numbers

Marks the tail-call slice on \`gabor/wasm-exts-sync\` as complete:

- Stack table: adds the Shared-fn fix + gauss-bench commit
  (\`6491d605c\`) and the flag-gated loop-elision commit
  (\`e84769da5\`).
- Design notes: replaces "self-recursion still loops" with the
  Shared-bodies-excluded fix and the flag-gated self-tail
  loop-rewrite story.
- Empirical: adds the \`gauss\` table — self-tail-recursion under
  the flag is **8.2% cheaper** than the loop rewrite. Combined
  with mutual TCO's 2.5% (\`go\`), the flag is a strict win on
  every measured shape, contradicting the intuition that "loop
  rewrite is cheaper than return_call for self-tail."
- Future work: drops the now-superseded "producer heuristics"
  bullet, adds a "default-on \`--experimental-tailcalls\`" item
  (flip default + remove loop rewrite once IC support is
  universal — empirical data already favours it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Names the bench after what it actually exercises — the Hutton/Bahr
stack VM. \`hutton\` (mutual-tail-recursive VM) and \`gauss\`
(self-tail-recursive foldLeft) now form a coherent named pair.

Plan doc updated to track the rename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 5, 2026

@crusso — pinging you on the tail-call slice. Self-contained and opt-in behind --experimental-tailcalls; the PR description and .claude/plans/wasm-exts-update.md carry the full design. Two empirical wins on the IC instruction counter under the flag:

bench shape flag-off → flag-on delta
hutton mutually-tail-recursive Hutton/Bahr stack VM, fak(10) × 1k 26.05M → 25.42M −2.5%
gauss self-tail-recursive foldLeft (+) 0 [1..100] × 10k 110.52M → 101.40M −8.2%

So return_call beats both ordinary call (mutual case) and the existing self-tail → loop rewrite (gauss case) on cycles, contradicting the intuition that "loop rewrite is strictly cheaper than return_call for self-tail."

Two concrete future directions worth noting alongside what's already in the plan (§ Source-level annotation (with tailcall) and § return_call_indirect for computed tail-calls):

  1. return_call_indirect for tail-position invocation of dynamic closures. Today the codegen specialises only the direct call arm (SR.Const Const.Fun (..., mk_fi, _)ReturnCall <fi>). Closure-call sites (_, Type.Local via Closure.call_closure) silently degrade to non-tail call_indirect even when the IR carries TailCallPrim. Extending Closure.call_closure to take an is_tail parameter and conditionally emit return_call_indirect would close that gap — the wasm-exts AST/codec already grew the variant, only the lowering is missing. Now also unblocked on the IC side: fix: instructions for Operator::ReturnCallIndirect dfinity/ic#10086 just merged, cutting Operator::ReturnCallIndirect cost 60 → 6 (10× reduction), so once a pocket-ic release picks it up, indirect tail-calls measure at parity with ordinary call_indirect.

  2. Destination-passing style for TCMC (Tail Call Modulo Cons). Functions like map f (x :: xs) = f x :: map f xs are not in tail position under standard TCO (the recursive call sits inside a cons constructor), and currently stack-overflow on long lists. The OCaml-5 [@tail_mod_cons] / Koka / Lean trick: allocate the constructor ?(f(x), <hole>) eagerly on the heap, pass the address of <hole> to the recursive call as a destination argument, and have the recursive call write its result into the destination slot before tail-calling itself. The recursive call thereby becomes tail (modulo a heap write), and return_call can apply uniformly. Implementation would touch the IR (a pre-pass recognising tail-mod-cons patterns), the calling convention (extra destination operand), and the allocator (exposing unfilled slots through the write barrier). Complementary to (1): (1) handles "tail call to a computed callee," (2) handles "tail call to self with a constructor in the way."

Together with the (with tailcall) annotation idea from the plan, the long-term shape is: explicit user intent → producer pass classifies every recursive call site (direct, indirect, or modulo-cons) → uniform return_call / return_call_indirect lowering → bounded stack everywhere, default-on once IC support is universal.

Happy to chase any of these as separate follow-up PRs if there's interest.

ggreif and others added 2 commits May 5, 2026 14:40
Pulls the conditional inside \`G.i\` so we just pick the constructor
(\`ReturnCall\` vs \`Call\`) without duplicating the surrounding
\`G.i (...)\` wrapper. The trailing \`FakeMultiVal.load\` is now
unconditional: after \`return_call\` it sits in unreachable code,
which the wasm validator accepts via the post-terminator
polymorphic-stack rule. For arity ≤ 1 (the common case)
\`FakeMultiVal.load\` is \`G.nop\` so this changes nothing on the
wire; for arity > 1 it adds a few \`global.get\` bytes that never
execute.

Per @ggreif's review on PR #6043.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test-runner's \`extract_directive\` (\`test-runner/src/run_test.rs\`)
does substring-find on \`//MOC-FLAG\` per line, no line-start anchor,
so my prose mention \`//MOC-FLAG --experimental-tailcalls\` in the
top-of-file comment was being parsed as a second directive — the
trailing backtick after the matched substring then landed inside
moc's flag string, and moc rejected \`--experimental-tailcalls\\`\` as
an unknown option.

Cheapest fix: rephrase the prose so the literal \`//MOC-FLAG\`
substring no longer appears. Proper fix is to anchor
\`extract_directive\` on line-start; left as a follow-up since this
PR shouldn't change the test-runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif ggreif added performance Affects only gas usage or code size labels May 5, 2026
Copy link
Copy Markdown
Contributor

crusso commented May 5, 2026

Description all sounds sensible but haven't looked at code at all. Impressive!

@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 5, 2026

Description all sounds sensible but haven't looked at code at all. Impressive!

The code is surprisingly straightforward too!

cycle axis, so the migration is a strict improvement plus
bounded-stack.

## Source-level annotation: `(with tailcall)` — *proposed*
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as the cycle counts are stabilised and the tail-call instructions are consistently cheaper than the stack-eaters, we'll flip the default for --experimental-tailcalls. Then we'll remove the flag from the algorithm (and with it the loop emulation) and reuse the flag for TCMC and other clever stuff.

sensible recommendation for the `mo:core` library or only for niche
deep-recursion cases.

### Interaction with `(with tailcall)` annotation
Copy link
Copy Markdown
Contributor Author

@ggreif ggreif May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the return_call* is cheaper (and better) in all regards, (with tailcall) won't make sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to understand it as a directive: error out if the tail-call is not realisable by the compiler mid-end transform.

Copy link
Copy Markdown
Contributor Author

@ggreif ggreif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to document the flag to the user?

ggreif and others added 6 commits May 5, 2026 21:52
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`#6043` shipped tail-call codegen for direct `call`s only — closure-
dispatched calls (anything reached via `call_indirect`) silently
degraded to non-tail dispatch. mccarthy-style mutually tail-recursive
`async* Bool` chains stack-overflow at ~1k hops because every `await*`
goes through the closure dispatch path.

Three tightly coupled changes flip indirect tail-calls on:

1. `Closure.call_closure` (both backends) gains an optional
   `~is_tail` parameter and emits `ReturnCallIndirect` when set.
2. The `_, Type.Local` arm in `compile_*.ml` (the closure-call
   producer in `CallPrim _ | TailCallPrim _ as cp`) threads
   `is_tail` through.
3. **The linker.** `linkModule.ml`'s `rename_types` pass renumbers
   type-table indices when merging RTS + user module — and its
   instruction matcher handled `CallIndirect` but NOT
   `ReturnCallIndirect`, so the new tail-call sites kept stale
   pre-merge type indices and the binary failed wasmtime's
   validator with bogus type signatures (e.g. `(func (result i64))`
   on a 4-i64-no-result call site).

Validates and runs end-to-end on the IC: the new `mccarthy` row in
`test/bench/tailcall.mo` (mutually tail-recursive `isEven`/`isOdd`
in `async* Bool`, chained via `await*`) executes 100k mutual hops
in ~547 cycles each, no stack growth.

`tailcall.valid.{ok,ret.ok}` capture wabt's stricter (and arguably
buggy) reading of `return_call_indirect` under `table64`: wabt
accepts an i64 table-index for `call_indirect` but expects i32 for
`return_call_indirect`. wasmtime treats them symmetrically and
accepts our binary. Tracking these messages keeps the test honest
to the toolchain we ship with — if wabt fixes the asymmetry the
.ok files will need re-acceptance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the indirect-tail-call work out of "future work" / "proposed"
and into the delivered table. Updates the codegen-scope note to
reflect that closure dispatch now also emits `ReturnCallIndirect`,
and records the linker-side fix (the bug that hid the codegen change
behind a wasmtime validation error).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 6, 2026

Update: extending TCO to indirect calls + linker fix

Spotted a real gap while drafting a follow-up bench: the original PR's TCO covered direct call only — closure-dispatched calls (anything reached via call_indirect) silently degraded to non-tail dispatch. Mutually tail-recursive async* Bool chains stack-overflowed at ~1k hops because every await* goes through the closure dispatch path (the lambda async-lowering produces from [v_ret; v_fail; v_clean] -->* call user_body krb).

Fixed in commit 113d56dc9 plus plan housekeeping in c3fb49ed3. Three coupled changes:

  1. Closure.call_closure (both backends) gains an optional ~is_tail parameter and emits ReturnCallIndirect when set.
  2. The _, Type.Local arm in compile_*.ml (the closure-call producer in CallPrim _ | TailCallPrim _ as cp) threads is_tail through.
  3. The linker. linkModule.ml's rename_types pass renumbers type-table indices when merging RTS + user module — and its instruction matcher handled CallIndirect but not ReturnCallIndirect, so the new tail-call sites kept stale pre-merge type indices and the binary failed wasmtime's validator with bogus type signatures (e.g. (func (result i64)) on a 4-i64-no-result call site). One-line fix.

New mccarthy bench

Added a third row to test/bench/tailcall.mo: mutually tail-recursive isEven/isOdd in async* Bool, chained via await*.

Bench Mode Cycles
hutton direct callreturn_call 25,416,088
gauss self-recursive foldLeftreturn_call 101,400,270
mccarthy indirect call_indirectreturn_call_indirect (new) 54,700,107

100k mutual hops, no stack growth. ~547 cycles per hop.

tailcall.valid.{ok,ret.ok} quirk

wabt's wasm-validate rejects return_call_indirect with an i64 table-index under table64 even though it accepts the same i64 index for plain call_indirect. wasmtime treats them symmetrically and accepts our binary; the canister runs end-to-end. The new valid.ok files capture wabt's error messages so the test harness stays honest about that toolchain asymmetry — if wabt closes the gap upstream we'll re-accept.

ggreif and others added 2 commits May 6, 2026 11:22
Lift the var-construction (`nr (mk_fi ())` / `nr table_index, nr ty`)
out of both branches so the constructor choice is the only thing
the `if` decides. Style nit suggested mid-review; same emitted code,
no functional change.

Applied to all four sites: direct + indirect, classical + enhanced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggreif
Copy link
Copy Markdown
Contributor Author

ggreif commented May 6, 2026

Filed the upstream fix for the wabt return_call_indirect + table64 asymmetry referenced in 113d56dc9: WebAssembly/wabt#2744. Three-line patch + a test/typecheck/ regression.

Once that lands and we bump our pinned wabt, the test/bench/ok/tailcall.valid.{ok,ret.ok} files added in this PR can be removed (or re-accepted to a clean validate), since they only exist to document wabt's stricter-than-wasmtime reading of return_call_indirect under table64.

`return_call_indirect` under `table64` rejects an i64 table-index in
upstream wabt, while the matching `call_indirect` accepts it — a
clean asymmetry in the validator. The patched wabt mirrors the
working `OnCallIndirect` plumbing in `OnReturnCallIndirect` (4 lines
+ a regression test). With the patch, our `tailcall` bench's
`return_call_indirect` sites validate clean and the temporary
`tailcall.valid.{ok,ret.ok}` files this PR introduced are no longer
needed — drop them.

Patch is fetched from the open upstream PR via `fetchpatch`. Drop
this overlay (and the two-line `__intentionallyOverridingVersion`
shim) once the fix lands and nixpkgs picks it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/exes/moc.ml Outdated
…ackage.nix`

Replaces the local `super.wabt.overrideAttrs + fetchpatch` overlay with a
single-file flake input pinned at NixOS/nixpkgs#517726 (wabt 1.0.41) and a
trivial `super.callPackage wabt-package-src {}` overlay.

`type = "file"` fetches just the ~5 KB `package.nix` expression (narHash
pinned in flake.lock) — no second nixpkgs tarball, no second nixpkgs
evaluation. `super.callPackage` injects stdenv, cmake, python3, gtest,
fetchFromGitHub, … from the host nixpkgs, so the resulting derivation is
identical to what NixOS/nixpkgs#517726 would produce upstream.

Wins over `5a8df27df`'s fetchpatch:
- `wabt --version` prints `1.0.41` (was `1.0.40-pre-2744`); drops the
  `__intentionallyOverridingVersion = true` shim.
- No fetchpatch hash drift to chase.
- Cleanup at end-of-life is mechanical: drop the input, drop the
  overlay, no version-string unwind.

`nix eval` confirms `pkgs.wabt.version == "1.0.41"`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ggreif and others added 5 commits May 8, 2026 21:25
…erge commit

PR was merged 2026-05-08 14:41 UTC. Switching the `type = "file"`
input from the fork (`ggreif/nixpkgs@d26416cf`) to the upstream
merge commit (`NixOS/nixpkgs@5a610f5b4f`). The narHash is
identical — the merged file is byte-for-byte the fork's content,
confirming the PR landed exactly as proposed.

Verified locally: `nix develop --command wasm-validate --version`
and `wat2wasm --version` both report `1.0.41`.

Next cleanup (separate commit, when nixpkgs-unstable rolls past
the merge): drop the input + the overlay entirely, falling back
to plain `pkgs.wabt`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master switched to `nixpkgs-unstable` (PR #6105 / #6104) which carries
wabt 1.0.41 by default, so the `wabt-package-src` flake input and the
`super.callPackage` overlay introduced for #517726-bridging are no
longer needed.

Removes:
- `wabt-package-src` input + outputs arg in `flake.nix`
- `super.callPackage wabt-package-src {}` overlay in `nix/pkgs.nix`
- corresponding entry from `flake.lock`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request performance Affects only gas usage or code size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants