feat(tailcalls): opt-in `return_call` codegen via `--experimental-tailcalls` by ggreif · Pull Request #6043 · caffeinelabs/motoko

ggreif · 2026-04-22T05:21:57Z

Summary

End-to-end opt-in support for wasm tail calls (return_call / return_call_indirect AST in wasm-exts; TailCallPrim IR; producer pass; codegen for direct calls), gated behind the new --experimental-tailcalls flag. With the flag on, all tail calls — both self-recursive and cross-function — lower to wasm return_call, giving bounded stack for mutually-recursive / VM-dispatcher-shaped code at a small cycle benefit relative to the existing self-tail-call → loop rewrite. With the flag off, current behaviour is preserved exactly.

This PR was originally scoped as a broader wasm-exts sync to upstream wasm-2.0.2 (SIMD, reference types, multi-memory, …). After confirming that tail calls are not in wasm-2.0.2 (the OCaml interpreter at that release predates the proposal merge — see .claude/plans/wasm-exts-update.md), the tail-call work was carved out as its own self-contained slice on top of the existing wasm-exts. The broader 2.0.2 sync remains future work.

What's delivered

layer	commit	what
wasm-exts	`f2c69f3e7`	`ReturnCall` / `ReturnCallIndirect` AST variants, smart constructors, encoder (opcodes `0x12` / `0x13`), decoder (opcodes removed from the illegal-list)
CLI	`a7ebc369d`	`--experimental-tailcalls` flag wiring in `flags.ml` + `moc.ml` (no behaviour change yet)
IR	`dd2d8337a`	`TailCallPrim of Type.typ list` declaration + exhaustive plumbing through interpreter, type-checker, effect inference, async lowering, both backends
codegen	`ef2c7fb0c`	`compile_classical.ml` and `compile_enhanced.ml` direct-call arm: `is_tail` derived from prim, emits `ReturnCall <fi>` and skips trailing `FakeMultiVal.load` when the prim is `TailCallPrim`
producer	`4dbdb1712`	`tailcall.ml` gains a third arm: in tail position, when the self-recursion loop-rewrite doesn't apply and `--experimental-tailcalls` is set, emit `TailCallPrim` instead of `CallPrim`
Shared-fn fix	`6491d605c`	`tailcall.ml` `FuncE` arm gates `tail_pos = true` on `s = Type.Local`. Shared bodies are descended with `tail_pos = false`, so the producer cannot label the wrapper-to-body call as tail-position. Fixes a cleanup-bypass trap that surfaced once `gauss` exercised the post-update-message path. Also passes `--enable-tail-call` to `wasm-validate` in the test-runner (`test-runner/src/run_test.rs`).
flag-gated loop elision	`e84769da5`	`tailcall.ml` self-tail-call → loop rewrite is now gated `not !Flags.experimental_tailcalls`. With the flag on, self-tail calls fall through to `TailCallPrim` and become `return_call`; with the flag off, the legacy loop rewrite still fires.
bench	`fb38c1b24`, `e24ebc003`, `6491d605c`, `e84769da5`	`test/bench/tailcall.mo`: Hutton/Bahr-style stack VM running `fak` (mutual TCO) + naïve self-recursive `foldLeft` running 5-Year-Old Gauss (loop-rewrite vs `return_call`). Self-documents via `//MOC-FLAG --experimental-tailcalls`.

Empirical (EOP, `--experimental-tailcalls` on)

benchmark	shape	cycles
`hutton`	mutual-tail-recursive VM dispatcher, `fak(10)` × 1_000	25_416_088
`gauss`	self-tail-recursive `foldLeft (+) 0 [1..100]` × 10_000	101_400_270

Comparisons (same tree, same EOP setting, only the flag flipped):

hutton (mutual): −636_000 cycles (~2.5%) vs call. Pure cost-table delta (Call=5 → ReturnCall=3 × dispatch count). Also bounded-stack — VM dispatchers can run forever without growing the wasm stack.
gauss (self-tail): −9_120_000 cycles (~8.2%) vs the loop rewrite. The loop rewrite has its own overhead (mutable arg-temps + loop/block wrapper) that return_call skips. Removing the loop rewrite under the flag is cheaper, not just stack-bounded.

Design notes

Pipeline order matters. tailcall_optimization runs after async_lowering (pipeline.ml:796-797), so by the time the producer arm sees a CallPrim, awaitable / IC calls have already been desugared. The producer therefore cannot mistakenly tag a shared call as a tail call.
Codegen scope: direct calls only. TailCallPrim is honoured in the SR.Const Const.Fun (..., mk_fi, _) arm (known function index → emit ReturnCall <fi>). Closure calls (Type.Local via call_indirect) and shared calls fall through to the regular path even if the prim is TailCallPrim. Extending to return_call_indirect for computed tail-calls is described in the plan as a follow-up.
Validation argument. Wasm return_call requires the callee's wasm result type to match the enclosing function's. Motoko's all-multival-via-side-channel ABI gives every wasm function the result type [] (or a single i64), so the constraint holds trivially for direct calls between Motoko functions.
Shared-function bodies excluded. A Motoko public func foo() : async () = body compiles to a wasm wrapper of the form message_start ; user-body ; message_cleanup (state-machine transition + GC). The body is not in tail position from the wasm wrapper's perspective. Letting return_call escape the body would skip the cleanup, leaving the lifecycle stuck at InUpdate — the next update message then traps with 'internal error: unexpected state entering InUpdate'. The fix is to descend Shared FuncE bodies with tail_pos = false.

Future work (scoped out, kept in the plan)

Broader wasm-exts 2.0.2 sync — the original ambition of this branch (SIMD, reference types, multi-memory, exception handling, …). Tail calls are not in 2.0.2 anyway, so the slice was carved out independently.
Source-level annotation (with tailcall) — per-call surface knob (presence-form via record-punning on a let tailcall = true) with a compile-time Bool-const constraint and a tail-position warning. Useful workaround for mo:core recursive algorithms.
return_call_indirect for computed tail-calls — VM dispatchers that index into a handler table, CPS continuations, dynamic dispatch in interpreters. Requires extending Closure.call_closure. The wasm-exts AST/codec already supports it; codegen lowering does not.

See .claude/plans/wasm-exts-update.md for the full design treatment, including open design questions for the annotation and the IC pricing measurements that informed the empirical comparison.

.claude/plans/abstract-interpreter.md — background on the ConstTrack abstract interpreter (relevant when emitting new instructions).
fix: instructions for Operator::ReturnCallIndirect dfinity/ic#10086 — concurrent IC fix bringing Operator::ReturnCallIndirect cost from 60 → 6 (10× cut), affecting the proposed return_call_indirect follow-up but not this PR's direct-call path.

🤖 Generated with Claude Code

github-actions · 2026-04-22T05:23:01Z

Comparing from f8ad7b0 to a3ff621:
The produced WebAssembly code seems to be completely unchanged.
In terms of gas, no changes are observed in 5 tests.
In terms of size, no changes are observed in 5 tests.

Moved to PR #6043 (wasm-exts sync), which will be the natural home for constant-tracking reasoning as new instructions come in. No content loss — the file is reproduced verbatim on gabor/wasm-exts-sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Moves the plan file here from PR #5961's branch where it sat as a leftover companion doc. Outlines the forthcoming `wasm-exts` sync work — pulling upstream instruction support forward so codegen can emit SIMD, newer ref-types, etc. This PR is the natural home. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Moved to PR #6043 (wasm-exts sync), which is the natural home for the upcoming instruction-catchup work. No content loss — file is reproduced verbatim on gabor/wasm-exts-sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds AST variants, smart constructors, encoder (opcodes 0x12 / 0x13), and decoder cases for the tail-call proposal. The corresponding slots are no longer in the illegal-opcode list of the binary decoder. Unblocks `Tailcall.transform`'s standing TODO at \`src/ir_passes/tailcall.ml:13-14\` ("can easily be extended to non-self tail calls, once supported by wasm") — the AST level can now express what the optimiser would emit. \`pocket-ic\` runtime support is the remaining open question (filed in PR #6043 thread). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ine) A small stack VM whose dispatcher is mutually tail-recursive: `step` matches the opcode and tail-calls a per-opcode handler (`opPush`, `opMul`, …); each handler tail-calls `step` again. Today's `Tailcall` IR pass only rewrites *self* tail-calls into loops, so each cross-function hop currently allocates a fresh frame. This is the textbook shape for general TCO: first-order, no closures, mutually tail-recursive. Once the optimiser is extended to emit `return_call` for non-self tail calls (TODO at \`tailcall.ml:13-14\`, unblocked by wasm-exts \`ReturnCall\` landing on this branch), the diff against the committed cycle count will be the measurement. Baseline (1000 iterations of fak(10)): 15_439_607 cycles ≈ 15.4k/iter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Just the flag, currently a no-op. IR primitive (\`TailCallPrim\`) and backend codegen (emitting \`return_call\` / \`return_call_indirect\`) land in follow-up commits — wasm-exts already grew the AST level support in \`f2c69f3e7\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a new \`prim\` constructor \`TailCallPrim of Type.typ list\` as a sibling of \`CallPrim\`, marking a function call that occurs in tail position (cf. wasm \`return_call\`). For now nothing produces it; this commit only wires the variant exhaustively through the consumers so the IR type and all passes remain total: - \`ir.ml\`: declaration + \`t_typ\` traversal - \`arrange_ir.ml\`: pretty-print as "TailCallPrim" - \`ir_effect.ml\`, \`check_ir.ml\`, \`interpret_ir.ml\`: handled identically to \`CallPrim\` via or-patterns - \`async.ml\`: tail-calls to awaitable functions fall into the same async-lowering arm (defensive — a producer should not emit them, but if it did, the lowering would still be correct) - \`compile_classical.ml\`, \`compile_enhanced.ml\`: both backends treat \`TailCallPrim\` like \`CallPrim\`; backend specialisation to emit \`return_call\` lands next. The \`tailcall.ml\` IR pass needs no change yet — its catch-all \`PrimE (p, es)\` arm handles \`TailCallPrim\` correctly (children descended in non-tail context, no rewrite). The producer extension that converts non-self tail-positioned \`CallPrim\` into \`TailCallPrim\` (gated by \`--experimental-tailcalls\`) is the next step, alongside the backend codegen. \`test/bench/tailcall.mo\` baseline cycles unchanged (verified against the committed .ok), confirming this is a pure plumbing commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both backends now branch on the prim variant in the call arm. When the prim is \`TailCallPrim\` and the callee is a known function index (\`SR.Const Const.Fun (..., mk_fi, _)\`), emit \`ReturnCall <fi>\` and omit the trailing \`FakeMultiVal.load\` (control never returns to that point). Otherwise — closure calls (\`Type.Local\` via \`call_indirect\`) and shared/IC calls — fall through to the regular call path even when the prim is \`TailCallPrim\`; the producer should not emit those, but falling through is semantics-preserving if it does. Validation note: wasm \`return_call\` requires the callee's result type to match the enclosing function's. Motoko's all-multival-via-side-channel ABI gives every wasm function the result type \`[]\`, so the constraint is trivially satisfied for direct calls between Motoko functions. Dormant for now: the \`tailcall.ml\` IR pass does not yet produce \`TailCallPrim\`. The bench cycle count is unchanged (verified against the committed .ok). The producer extension — emit \`TailCallPrim\` for non-self tail-positioned calls when \`--experimental-tailcalls\` is set — is the final piece. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends the \`Tailcall\` pass with a third arm. Behaviour: - Self tail-call with matching type instantiation → loop rewrite (existing path; strictly cheaper than \`return_call\` for self). - Otherwise, in tail position, with \`--experimental-tailcalls\` set → emit \`PrimE (TailCallPrim insts, ...)\`. The codegen path already in place lowers it to wasm \`return_call\`. - Else → ordinary \`CallPrim\` (no behaviour change). The pipeline order \`async_lowering ; tailcall_optimization\` (\`pipeline.ml:796-797\`) means awaitable calls have already been desugared by the time we see them, so this arm cannot tag a shared or local-async call as \`TailCallPrim\`. Measured against \`test/bench/tailcall.mo\` (Hutton VM, fak 10 ×1000): baseline (no flag): 15_439_607 cycles with --experimental-tailcalls: 25_416_088 cycles So this is currently a *regression* on the IC instruction counter (~65% more cycles per mutual dispatch hop). The codegen is correct (\`fak10 = 3_628_800\` in both). The cause is not \`moc\`-side: the emitted \`return_call\` replaces a \`call\` with no surrounding code change, yet the IC's wasmtime/metering charges \`return_call\` more heavily than \`call\` followed by an implicit return. So the infrastructure is ready, but the actual win has to wait on either runtime/metering improvements or selective producer heuristics. Default behaviour is unchanged — the flag is opt-in. The committed bench .ok stays at the baseline number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "What We Do NOT Need (initially)" bullet for tail calls is now out of date — the slice was delivered ahead of the broader 2.0.2 sync. Flips the bullet and adds a § *Tail-call instructions (delivered)* covering the commit chain, the design choices (pipeline ordering, direct-call-only scope, validation argument, why self-tail still loops), and the IC instruction-counter finding (\`return_call\` is ~65% pricier per hop, so the flag is a bounded-stack opt-in rather than a perf knob). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Records the design idea for a per-call source-level opt-in. Captures: - Motivation: per-call granularity matches the property \`return_call\` provides (bounded stack at metered cost); equally valuable as a declarative diagnostic à la Scala \`@tailrec\`. Near- term workaround for \`mo:core\` recursive algorithms while the IC's per-instruction \`return_call\` cost remains elevated. - Surface syntax: both \`(with tailcall = true)\` and the punning \`(with tailcall)\` form, neither requiring grammar changes. - Typecheck constraint: must be compile-time-known Bool. - Lowering: bypass the flag-gated producer arm, go straight to \`TailCallPrim\` and reuse existing codegen. - Four open design questions: flag interaction, where to run the tail-position warning, self-recursion behaviour, indirect calls. Marked *proposed* (not delivered). Existing "Future work" bullets restructured so the annotation is the headline next step and "producer heuristics" is explicitly subsumed by it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ail-calls Reframes the indirect-tail-call follow-up around its actual use case — *computed* tail-calls, where the callee is a value chosen at runtime rather than a statically-known name. That's the complement of the delivered direct-call path: VM dispatchers that index into a handler table, CPS-transformed continuations, dynamic-dispatch interpreters. Section covers: - Motivation: what direct \`return_call\` *can't* do. - Codegen plumbing: \`Closure.call_closure\` gains an \`is_tail\` parameter; both backends touched symmetrically. - Wasm validation precondition: type-table entry's result type must match the enclosing function's. Motoko's uniform ABI satisfies this in practice but emission must enforce it explicitly. - Bench coverage: a sibling \`tailcall-computed.mo\` storing opcode handlers as closures isolates the indirect-call premium against the existing direct-call bench. - Cycle-cost expectation: unknown without measurement; \`call_indirect\` is already pricier than \`call\`, so the relative tail-call premium may differ. - Interaction with \`(with tailcall)\`: the annotation's "direct only" warning becomes unnecessary once this lands. Existing "Future work" bullet collapses to a forward reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lls\` Folds the experimental-tailcalls flag directly into the source so the bench unconditionally measures the TCO codegen path — what it's named for. Re-accepts the .ok against the EOP+TCO baseline: cycles = 25_416_088 (was 15_439_607) The earlier 15.4M number came from running before the producer + backend codegen landed AND with run-test's classical-persistence default; the new 25.4M is EOP + TCO, matching what \`make tailcall.only\` in this repo's bench Makefile actually exercises. Side-effect of TCO codegen: wasm-validate doesn't know \`return_call\` (opcode 0x12), so its expected output is the error itself (committed as \`tailcall.valid.ok\` / \`.valid.ret.ok\`). Could pass \`--enable-tail-call\` to wasm-validate in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "Empirical: IC instruction-counter cost" section said the flag caused a ~65% cycle regression. That was a methodology error: the "baseline" 15.4M came from a committed .ok captured before the producer + backend landed, and \`run-test\`'s default persistence mode (classical) differs from the bench Makefile's (EOP). Honest same-tree, same-EOP-setting comparison: | build | cycles | | --- | --- | | no \`--experimental-tailcalls\` | 26_052_088 | | \`--experimental-tailcalls\` | 25_416_088 | | delta | -636_000 (TCO cheaper, ~2.5%) | Reframes the section: TCO is mildly *cheaper* on the cycle axis, in agreement with the cost table (Call=5, ReturnCall=3). The flag's primary value remains bounded-stack guarantees for mutual recursion / VM dispatchers / CPS, with the cycle reduction as a small bonus. Bench .ok was relocked to the with-flag EOP baseline in \`e24ebc003\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…\` bench The \`FuncE\` arm in \`tailcall.ml\` set \`tail_pos = true\` for every function body unconditionally. For Shared functions (post async- lowering: \`Shared+Replies\` for awaitables, \`Shared+Returns\` for one-shot oneway updates) that's wrong: the wasm-level wrapper has \`message_start ; user-body ; message_cleanup\` (state-machine transition + GC) — cleanup runs *below* the body, so the body is not in tail position from the wasm function's perspective. With \`--experimental-tailcalls\` set, the producer arm would emit \`TailCallPrim\` for the body's last call and codegen would lower to \`return_call \$\$lambda.N\`, bypassing the cleanup. The lifecycle state stays \`InUpdate\`, and the next message traps with "internal error: unexpected state entering InUpdate" when the runtime tries \`Idle → InUpdate\` again. Fix: gate \`tail_pos\` on \`s = Type.Local\`. Shared bodies are descended with \`tail_pos = false\` (via \`exp\` instead of \`tailexp\`), so the producer can no longer label the wrapper-to- body call as tail-position. Existing self-tail-recursion → loop rewrite for Local functions is preserved unchanged. Also adds the \`gauss\` bench: naïve self-recursive \`foldLeft (+) 0\` over [1..100] × 10_000. Complements the existing VM bench by exercising the loop-rewrite path (\`tailcall.ml:185-200\`) — today both flag-on and flag-off compile foldLeft to a wasm \`loop\`. When the loop-rewrite is later removed in favour of uniform \`return_call\` codegen, the cycle delta on \`gauss\` will be the load-bearing measurement. Pass \`--enable-tail-call\` to \`wasm-validate\` in the test runner so benches that emit \`return_call\` no longer trip on the validator side. Drops the \`tailcall.valid.ok\` / \`.valid.ret.ok\` files that were capturing the validator's \`unexpected opcode: 0x12\` error — no longer needed. Bench numbers (EOP, \`--experimental-tailcalls\`): go (VM, mutual TCO): 25_416_088 cycles / 1_000 iter gauss (loop-rewritten foldLeft): 110_520_270 cycles / 10_000 iter The \`gauss\` bench is what surfaced the codegen bug above — without it the trap would not have fired (the existing \`go\` alone runs as the *first* update message and hits the cleanup-bypass invisibly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tailcalls\` When the flag is on, the existing self-tail-call → loop rewrite is skipped, so self-recursive tail calls fall through to the \`TailCallPrim\` arm and codegen emits \`return_call\` instead of the \`loop { … local.set; br 0 }\` machinery. When the flag is off, the loop rewrite still fires (current default behaviour preserved). Measured on \`test/bench/tailcall.mo\` \`gauss\` (naïve self-recursive \`foldLeft (+) 0 [1..100]\` × 10_000): loop-rewrite (flag off / today): 110_520_270 cycles return_call (flag on, new): 101_400_270 cycles delta: -9_120_000 (~8.2% cheaper) The loop rewrite has its own overhead — it copies args into mutable temps and adds a \`loop\` / \`block\` wrapper — and per the IC's cost table, \`return_call\` (3) plus the wasm-level arg-passing it would do anyway beats that overhead in this benchmark. So removing the loop rewrite in favour of uniform \`return_call\` codegen is genuinely cheaper, not just stack-bounded. The flag now means: "use \`return_call\` for ALL tail calls, both self-tail and cross-function." Loop rewrite remains as the legacy fallback for flag-off. Bench .ok re-locked at \`gauss = 101_400_270\` (with-flag baseline); \`go\` unchanged at \`25_416_088\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…numbers Marks the tail-call slice on \`gabor/wasm-exts-sync\` as complete: - Stack table: adds the Shared-fn fix + gauss-bench commit (\`6491d605c\`) and the flag-gated loop-elision commit (\`e84769da5\`). - Design notes: replaces "self-recursion still loops" with the Shared-bodies-excluded fix and the flag-gated self-tail loop-rewrite story. - Empirical: adds the \`gauss\` table — self-tail-recursion under the flag is **8.2% cheaper** than the loop rewrite. Combined with mutual TCO's 2.5% (\`go\`), the flag is a strict win on every measured shape, contradicting the intuition that "loop rewrite is cheaper than return_call for self-tail." - Future work: drops the now-superseded "producer heuristics" bullet, adds a "default-on \`--experimental-tailcalls\`" item (flip default + remove loop rewrite once IC support is universal — empirical data already favours it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Names the bench after what it actually exercises — the Hutton/Bahr stack VM. \`hutton\` (mutual-tail-recursive VM) and \`gauss\` (self-tail-recursive foldLeft) now form a coherent named pair. Plan doc updated to track the rename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ggreif · 2026-05-05T11:18:42Z

@crusso — pinging you on the tail-call slice. Self-contained and opt-in behind --experimental-tailcalls; the PR description and .claude/plans/wasm-exts-update.md carry the full design. Two empirical wins on the IC instruction counter under the flag:

bench	shape	flag-off → flag-on	delta
`hutton`	mutually-tail-recursive Hutton/Bahr stack VM, `fak(10)` × 1k	26.05M → 25.42M	−2.5%
`gauss`	self-tail-recursive `foldLeft (+) 0 [1..100]` × 10k	110.52M → 101.40M	−8.2%

So return_call beats both ordinary call (mutual case) and the existing self-tail → loop rewrite (gauss case) on cycles, contradicting the intuition that "loop rewrite is strictly cheaper than return_call for self-tail."

Two concrete future directions worth noting alongside what's already in the plan (§ Source-level annotation (with tailcall) and § return_call_indirect for computed tail-calls):

return_call_indirect for tail-position invocation of dynamic closures. Today the codegen specialises only the direct call arm (SR.Const Const.Fun (..., mk_fi, _) → ReturnCall <fi>). Closure-call sites (_, Type.Local via Closure.call_closure) silently degrade to non-tail call_indirect even when the IR carries TailCallPrim. Extending Closure.call_closure to take an is_tail parameter and conditionally emit return_call_indirect would close that gap — the wasm-exts AST/codec already grew the variant, only the lowering is missing. Now also unblocked on the IC side: fix: instructions for Operator::ReturnCallIndirect dfinity/ic#10086 just merged, cutting Operator::ReturnCallIndirect cost 60 → 6 (10× reduction), so once a pocket-ic release picks it up, indirect tail-calls measure at parity with ordinary call_indirect.
Destination-passing style for TCMC (Tail Call Modulo Cons). Functions like map f (x :: xs) = f x :: map f xs are not in tail position under standard TCO (the recursive call sits inside a cons constructor), and currently stack-overflow on long lists. The OCaml-5 [@tail_mod_cons] / Koka / Lean trick: allocate the constructor ?(f(x), <hole>) eagerly on the heap, pass the address of <hole> to the recursive call as a destination argument, and have the recursive call write its result into the destination slot before tail-calling itself. The recursive call thereby becomes tail (modulo a heap write), and return_call can apply uniformly. Implementation would touch the IR (a pre-pass recognising tail-mod-cons patterns), the calling convention (extra destination operand), and the allocator (exposing unfilled slots through the write barrier). Complementary to (1): (1) handles "tail call to a computed callee," (2) handles "tail call to self with a constructor in the way."

Together with the (with tailcall) annotation idea from the plan, the long-term shape is: explicit user intent → producer pass classifies every recursive call site (direct, indirect, or modulo-cons) → uniform return_call / return_call_indirect lowering → bounded stack everywhere, default-on once IC support is universal.

Happy to chase any of these as separate follow-up PRs if there's interest.

@ggreif

Pulls the conditional inside \`G.i\` so we just pick the constructor (\`ReturnCall\` vs \`Call\`) without duplicating the surrounding \`G.i (...)\` wrapper. The trailing \`FakeMultiVal.load\` is now unconditional: after \`return_call\` it sits in unreachable code, which the wasm validator accepts via the post-terminator polymorphic-stack rule. For arity ≤ 1 (the common case) \`FakeMultiVal.load\` is \`G.nop\` so this changes nothing on the wire; for arity > 1 it adds a few \`global.get\` bytes that never execute. Per @ggreif's review on PR #6043. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The test-runner's \`extract_directive\` (\`test-runner/src/run_test.rs\`) does substring-find on \`//MOC-FLAG\` per line, no line-start anchor, so my prose mention \`//MOC-FLAG --experimental-tailcalls\` in the top-of-file comment was being parsed as a second directive — the trailing backtick after the matched substring then landed inside moc's flag string, and moc rejected \`--experimental-tailcalls\\`\` as an unknown option. Cheapest fix: rephrase the prose so the literal \`//MOC-FLAG\` substring no longer appears. Proper fix is to anchor \`extract_directive\` on line-start; left as a follow-up since this PR shouldn't change the test-runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

crusso · 2026-05-05T15:46:52Z

Description all sounds sensible but haven't looked at code at all. Impressive!

ggreif · 2026-05-05T19:21:16Z

Description all sounds sensible but haven't looked at code at all. Impressive!

The code is surprisingly straightforward too!

ggreif · 2026-05-05T19:26:59Z

+  cycle axis, so the migration is a strict improvement plus
+  bounded-stack.
+
+## Source-level annotation: `(with tailcall)` — *proposed*


As soon as the cycle counts are stabilised and the tail-call instructions are consistently cheaper than the stack-eaters, we'll flip the default for --experimental-tailcalls. Then we'll remove the flag from the algorithm (and with it the loop emulation) and reuse the flag for TCMC and other clever stuff.

ggreif · 2026-05-05T19:29:43Z

+sensible recommendation for the `mo:core` library or only for niche
+deep-recursion cases.
+
+### Interaction with `(with tailcall)` annotation


When the return_call* is cheaper (and better) in all regards, (with tailcall) won't make sense.

It might make sense to understand it as a directive: error out if the tail-call is not realisable by the compiler mid-end transform.

ggreif

Do we want to document the flag to the user?

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`#6043` shipped tail-call codegen for direct `call`s only — closure- dispatched calls (anything reached via `call_indirect`) silently degraded to non-tail dispatch. mccarthy-style mutually tail-recursive `async* Bool` chains stack-overflow at ~1k hops because every `await*` goes through the closure dispatch path. Three tightly coupled changes flip indirect tail-calls on: 1. `Closure.call_closure` (both backends) gains an optional `~is_tail` parameter and emits `ReturnCallIndirect` when set. 2. The `_, Type.Local` arm in `compile_*.ml` (the closure-call producer in `CallPrim _ | TailCallPrim _ as cp`) threads `is_tail` through. 3. **The linker.** `linkModule.ml`'s `rename_types` pass renumbers type-table indices when merging RTS + user module — and its instruction matcher handled `CallIndirect` but NOT `ReturnCallIndirect`, so the new tail-call sites kept stale pre-merge type indices and the binary failed wasmtime's validator with bogus type signatures (e.g. `(func (result i64))` on a 4-i64-no-result call site). Validates and runs end-to-end on the IC: the new `mccarthy` row in `test/bench/tailcall.mo` (mutually tail-recursive `isEven`/`isOdd` in `async* Bool`, chained via `await*`) executes 100k mutual hops in ~547 cycles each, no stack growth. `tailcall.valid.{ok,ret.ok}` capture wabt's stricter (and arguably buggy) reading of `return_call_indirect` under `table64`: wabt accepts an i64 table-index for `call_indirect` but expects i32 for `return_call_indirect`. wasmtime treats them symmetrically and accepts our binary. Tracking these messages keeps the test honest to the toolchain we ship with — if wabt fixes the asymmetry the .ok files will need re-acceptance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move the indirect-tail-call work out of "future work" / "proposed" and into the delivered table. Updates the codegen-scope note to reflect that closure dispatch now also emits `ReturnCallIndirect`, and records the linker-side fix (the bug that hid the codegen change behind a wasmtime validation error). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ggreif · 2026-05-06T09:13:40Z

Update: extending TCO to indirect calls + linker fix

Spotted a real gap while drafting a follow-up bench: the original PR's TCO covered direct call only — closure-dispatched calls (anything reached via call_indirect) silently degraded to non-tail dispatch. Mutually tail-recursive async* Bool chains stack-overflowed at ~1k hops because every await* goes through the closure dispatch path (the lambda async-lowering produces from [v_ret; v_fail; v_clean] -->* call user_body krb).

Fixed in commit 113d56dc9 plus plan housekeeping in c3fb49ed3. Three coupled changes:

Closure.call_closure (both backends) gains an optional ~is_tail parameter and emits ReturnCallIndirect when set.
The _, Type.Local arm in compile_*.ml (the closure-call producer in CallPrim _ | TailCallPrim _ as cp) threads is_tail through.
The linker. linkModule.ml's rename_types pass renumbers type-table indices when merging RTS + user module — and its instruction matcher handled CallIndirect but not ReturnCallIndirect, so the new tail-call sites kept stale pre-merge type indices and the binary failed wasmtime's validator with bogus type signatures (e.g. (func (result i64)) on a 4-i64-no-result call site). One-line fix.

New `mccarthy` bench

Added a third row to test/bench/tailcall.mo: mutually tail-recursive isEven/isOdd in async* Bool, chained via await*.

Bench	Mode	Cycles
hutton	direct `call` → `return_call`	25,416,088
gauss	self-recursive `foldLeft` → `return_call`	101,400,270
mccarthy	indirect `call_indirect` → `return_call_indirect` (new)	54,700,107

100k mutual hops, no stack growth. ~547 cycles per hop.

`tailcall.valid.{ok,ret.ok}` quirk

wabt's wasm-validate rejects return_call_indirect with an i64 table-index under table64 even though it accepts the same i64 index for plain call_indirect. wasmtime treats them symmetrically and accepts our binary; the canister runs end-to-end. The new valid.ok files capture wabt's error messages so the test harness stays honest about that toolchain asymmetry — if wabt closes the gap upstream we'll re-accept.

Lift the var-construction (`nr (mk_fi ())` / `nr table_index, nr ty`) out of both branches so the constructor choice is the only thing the `if` decides. Style nit suggested mid-review; same emitted code, no functional change. Applied to all four sites: direct + indirect, classical + enhanced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ggreif · 2026-05-06T12:56:19Z

Filed the upstream fix for the wabt return_call_indirect + table64 asymmetry referenced in 113d56dc9: WebAssembly/wabt#2744. Three-line patch + a test/typecheck/ regression.

Once that lands and we bump our pinned wabt, the test/bench/ok/tailcall.valid.{ok,ret.ok} files added in this PR can be removed (or re-accepted to a clean validate), since they only exist to document wabt's stricter-than-wasmtime reading of return_call_indirect under table64.

`return_call_indirect` under `table64` rejects an i64 table-index in upstream wabt, while the matching `call_indirect` accepts it — a clean asymmetry in the validator. The patched wabt mirrors the working `OnCallIndirect` plumbing in `OnReturnCallIndirect` (4 lines + a regression test). With the patch, our `tailcall` bench's `return_call_indirect` sites validate clean and the temporary `tailcall.valid.{ok,ret.ok}` files this PR introduced are no longer needed — drop them. Patch is fetched from the open upstream PR via `fetchpatch`. Drop this overlay (and the two-line `__intentionallyOverridingVersion` shim) once the fix lands and nixpkgs picks it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ackage.nix` Replaces the local `super.wabt.overrideAttrs + fetchpatch` overlay with a single-file flake input pinned at NixOS/nixpkgs#517726 (wabt 1.0.41) and a trivial `super.callPackage wabt-package-src {}` overlay. `type = "file"` fetches just the ~5 KB `package.nix` expression (narHash pinned in flake.lock) — no second nixpkgs tarball, no second nixpkgs evaluation. `super.callPackage` injects stdenv, cmake, python3, gtest, fetchFromGitHub, … from the host nixpkgs, so the resulting derivation is identical to what NixOS/nixpkgs#517726 would produce upstream. Wins over `5a8df27df`'s fetchpatch: - `wabt --version` prints `1.0.41` (was `1.0.40-pre-2744`); drops the `__intentionallyOverridingVersion = true` shim. - No fetchpatch hash drift to chase. - Cleanup at end-of-life is mechanical: drop the input, drop the overlay, no version-string unwind. `nix eval` confirms `pkgs.wabt.version == "1.0.41"`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erge commit PR was merged 2026-05-08 14:41 UTC. Switching the `type = "file"` input from the fork (`ggreif/nixpkgs@d26416cf`) to the upstream merge commit (`NixOS/nixpkgs@5a610f5b4f`). The narHash is identical — the merged file is byte-for-byte the fork's content, confirming the PR landed exactly as proposed. Verified locally: `nix develop --command wasm-validate --version` and `wat2wasm --version` both report `1.0.41`. Next cleanup (separate commit, when nixpkgs-unstable rolls past the merge): drop the input + the overlay entirely, falling back to plain `pkgs.wabt`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master switched to `nixpkgs-unstable` (PR #6105 / #6104) which carries wabt 1.0.41 by default, so the `wabt-package-src` flake input and the `super.callPackage` overlay introduced for #517726-bridging are no longer needed. Removes: - `wabt-package-src` input + outputs arg in `flake.nix` - `super.callPackage wabt-package-src {}` overlay in `nix/pkgs.nix` - corresponding entry from `flake.lock` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ggreif self-assigned this Apr 22, 2026

ggreif changed the title ~~Syncing of wasm-exts to provide new instructions (e.g. SIMD!)~~ chore: syncing of wasm-exts to provide new instructions (e.g. SIMD!) Apr 22, 2026

ggreif force-pushed the gabor/wasm-exts-sync branch from c3b35dc to 35f58d0 Compare April 22, 2026 11:17

ggreif and others added 11 commits May 4, 2026 17:56

Merge branch 'master' into gabor/wasm-exts-sync

0c56701

Updating test/bench numbers

451e77a

ggreif commented May 4, 2026

View reviewed changes

Comment thread src/codegen/compile_classical.ml Outdated

ggreif commented May 4, 2026

View reviewed changes

Comment thread src/exes/moc.ml Outdated

ggreif and others added 6 commits May 5, 2026 11:22

Updating test/bench numbers

a4347d4

tweaks

3cf79c3

ggreif changed the title ~~chore: syncing of wasm-exts to provide new instructions (e.g. SIMD!)~~ feat(tailcalls): opt-in return_call codegen via --experimental-tailcalls May 5, 2026

ggreif and others added 2 commits May 5, 2026 13:01

ggreif and others added 2 commits May 5, 2026 14:40

ggreif added performance Affects only gas usage or code size labels May 5, 2026

ggreif commented May 5, 2026

View reviewed changes

ggreif and others added 6 commits May 5, 2026 21:52

chore(changelog): add \--experimental-tailcalls\ entry for #6043

c26f32e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(changelog): "small" -> "moderate" cycle reduction

dbf3474

Merge branch 'master' into gabor/wasm-exts-sync

6f20d42

Merge branch 'master' into gabor/wasm-exts-sync

20d4cee

ggreif and others added 2 commits May 6, 2026 11:22

Updating test/bench numbers

a35fec7

ggreif commented May 6, 2026

View reviewed changes

Comment thread src/exes/moc.ml Outdated

ggreif mentioned this pull request May 6, 2026

perf: precisely short-circuit exact forwarders (fmodf et al. libm calls) and zero-forwarders (Motoko-specific) #5961

Draft

3 tasks

ggreif mentioned this pull request May 8, 2026

experiment: tail recursion modulo constructor (TRMC) — v0 + v1 (TupPrim spines) #6098

Draft

ggreif added the feature New feature or request label May 8, 2026

ggreif and others added 5 commits May 8, 2026 21:25

Merge branch 'master' into gabor/wasm-exts-sync

b1a439b

Merge branch 'master' into gabor/wasm-exts-sync

36cabcb

Apply suggestion from @ggreif

a3ff621

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tailcalls): opt-in `return_call` codegen via `--experimental-tailcalls`#6043

feat(tailcalls): opt-in `return_call` codegen via `--experimental-tailcalls`#6043
ggreif wants to merge 39 commits into
masterfrom
gabor/wasm-exts-sync

ggreif commented Apr 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ggreif commented May 5, 2026

Uh oh!

crusso commented May 5, 2026

Uh oh!

ggreif commented May 5, 2026

Uh oh!

ggreif May 5, 2026

Uh oh!

ggreif May 5, 2026 •

edited

Loading

Uh oh!

ggreif May 10, 2026

Uh oh!

ggreif left a comment

Uh oh!

ggreif commented May 6, 2026

Uh oh!

ggreif commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ggreif commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's delivered

Empirical (EOP, --experimental-tailcalls on)

Design notes

Future work (scoped out, kept in the plan)

Related

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggreif commented May 5, 2026

Uh oh!

crusso commented May 5, 2026

Uh oh!

ggreif commented May 5, 2026

Uh oh!

ggreif May 5, 2026

Choose a reason for hiding this comment

Uh oh!

ggreif May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggreif May 10, 2026

Choose a reason for hiding this comment

Uh oh!

ggreif left a comment

Choose a reason for hiding this comment

Uh oh!

ggreif commented May 6, 2026

Update: extending TCO to indirect calls + linker fix

New mccarthy bench

tailcall.valid.{ok,ret.ok} quirk

Uh oh!

ggreif commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggreif commented Apr 22, 2026 •

edited

Loading

Empirical (EOP, `--experimental-tailcalls` on)

github-actions Bot commented Apr 22, 2026 •

edited

Loading

ggreif May 5, 2026 •

edited

Loading

New `mccarthy` bench

`tailcall.valid.{ok,ret.ok}` quirk