feat(tailcalls): opt-in return_call codegen via --experimental-tailcalls#6043
feat(tailcalls): opt-in return_call codegen via --experimental-tailcalls#6043ggreif wants to merge 39 commits into
return_call codegen via --experimental-tailcalls#6043Conversation
wasm-exts to provide new instructions (e.g. SIMD!)wasm-exts to provide new instructions (e.g. SIMD!)
Moved to PR #6043 (wasm-exts sync), which will be the natural home for constant-tracking reasoning as new instructions come in. No content loss — the file is reproduced verbatim on gabor/wasm-exts-sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the plan file here from PR #5961's branch where it sat as a leftover companion doc. Outlines the forthcoming `wasm-exts` sync work — pulling upstream instruction support forward so codegen can emit SIMD, newer ref-types, etc. This PR is the natural home. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c3b35dc to
35f58d0
Compare
Moved to PR #6043 (wasm-exts sync), which is the natural home for the upcoming instruction-catchup work. No content loss — file is reproduced verbatim on gabor/wasm-exts-sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AST variants, smart constructors, encoder (opcodes 0x12 / 0x13),
and decoder cases for the tail-call proposal. The corresponding slots
are no longer in the illegal-opcode list of the binary decoder.
Unblocks `Tailcall.transform`'s standing TODO at
\`src/ir_passes/tailcall.ml:13-14\` ("can easily be extended to
non-self tail calls, once supported by wasm") — the AST level can
now express what the optimiser would emit. \`pocket-ic\` runtime
support is the remaining open question (filed in PR #6043 thread).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ine) A small stack VM whose dispatcher is mutually tail-recursive: `step` matches the opcode and tail-calls a per-opcode handler (`opPush`, `opMul`, …); each handler tail-calls `step` again. Today's `Tailcall` IR pass only rewrites *self* tail-calls into loops, so each cross-function hop currently allocates a fresh frame. This is the textbook shape for general TCO: first-order, no closures, mutually tail-recursive. Once the optimiser is extended to emit `return_call` for non-self tail calls (TODO at \`tailcall.ml:13-14\`, unblocked by wasm-exts \`ReturnCall\` landing on this branch), the diff against the committed cycle count will be the measurement. Baseline (1000 iterations of fak(10)): 15_439_607 cycles ≈ 15.4k/iter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Just the flag, currently a no-op. IR primitive (\`TailCallPrim\`) and backend codegen (emitting \`return_call\` / \`return_call_indirect\`) land in follow-up commits — wasm-exts already grew the AST level support in \`f2c69f3e7\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new \`prim\` constructor \`TailCallPrim of Type.typ list\` as a sibling of \`CallPrim\`, marking a function call that occurs in tail position (cf. wasm \`return_call\`). For now nothing produces it; this commit only wires the variant exhaustively through the consumers so the IR type and all passes remain total: - \`ir.ml\`: declaration + \`t_typ\` traversal - \`arrange_ir.ml\`: pretty-print as "TailCallPrim" - \`ir_effect.ml\`, \`check_ir.ml\`, \`interpret_ir.ml\`: handled identically to \`CallPrim\` via or-patterns - \`async.ml\`: tail-calls to awaitable functions fall into the same async-lowering arm (defensive — a producer should not emit them, but if it did, the lowering would still be correct) - \`compile_classical.ml\`, \`compile_enhanced.ml\`: both backends treat \`TailCallPrim\` like \`CallPrim\`; backend specialisation to emit \`return_call\` lands next. The \`tailcall.ml\` IR pass needs no change yet — its catch-all \`PrimE (p, es)\` arm handles \`TailCallPrim\` correctly (children descended in non-tail context, no rewrite). The producer extension that converts non-self tail-positioned \`CallPrim\` into \`TailCallPrim\` (gated by \`--experimental-tailcalls\`) is the next step, alongside the backend codegen. \`test/bench/tailcall.mo\` baseline cycles unchanged (verified against the committed .ok), confirming this is a pure plumbing commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both backends now branch on the prim variant in the call arm. When the prim is \`TailCallPrim\` and the callee is a known function index (\`SR.Const Const.Fun (..., mk_fi, _)\`), emit \`ReturnCall <fi>\` and omit the trailing \`FakeMultiVal.load\` (control never returns to that point). Otherwise — closure calls (\`Type.Local\` via \`call_indirect\`) and shared/IC calls — fall through to the regular call path even when the prim is \`TailCallPrim\`; the producer should not emit those, but falling through is semantics-preserving if it does. Validation note: wasm \`return_call\` requires the callee's result type to match the enclosing function's. Motoko's all-multival-via-side-channel ABI gives every wasm function the result type \`[]\`, so the constraint is trivially satisfied for direct calls between Motoko functions. Dormant for now: the \`tailcall.ml\` IR pass does not yet produce \`TailCallPrim\`. The bench cycle count is unchanged (verified against the committed .ok). The producer extension — emit \`TailCallPrim\` for non-self tail-positioned calls when \`--experimental-tailcalls\` is set — is the final piece. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the \`Tailcall\` pass with a third arm. Behaviour: - Self tail-call with matching type instantiation → loop rewrite (existing path; strictly cheaper than \`return_call\` for self). - Otherwise, in tail position, with \`--experimental-tailcalls\` set → emit \`PrimE (TailCallPrim insts, ...)\`. The codegen path already in place lowers it to wasm \`return_call\`. - Else → ordinary \`CallPrim\` (no behaviour change). The pipeline order \`async_lowering ; tailcall_optimization\` (\`pipeline.ml:796-797\`) means awaitable calls have already been desugared by the time we see them, so this arm cannot tag a shared or local-async call as \`TailCallPrim\`. Measured against \`test/bench/tailcall.mo\` (Hutton VM, fak 10 ×1000): baseline (no flag): 15_439_607 cycles with --experimental-tailcalls: 25_416_088 cycles So this is currently a *regression* on the IC instruction counter (~65% more cycles per mutual dispatch hop). The codegen is correct (\`fak10 = 3_628_800\` in both). The cause is not \`moc\`-side: the emitted \`return_call\` replaces a \`call\` with no surrounding code change, yet the IC's wasmtime/metering charges \`return_call\` more heavily than \`call\` followed by an implicit return. So the infrastructure is ready, but the actual win has to wait on either runtime/metering improvements or selective producer heuristics. Default behaviour is unchanged — the flag is opt-in. The committed bench .ok stays at the baseline number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "What We Do NOT Need (initially)" bullet for tail calls is now out of date — the slice was delivered ahead of the broader 2.0.2 sync. Flips the bullet and adds a § *Tail-call instructions (delivered)* covering the commit chain, the design choices (pipeline ordering, direct-call-only scope, validation argument, why self-tail still loops), and the IC instruction-counter finding (\`return_call\` is ~65% pricier per hop, so the flag is a bounded-stack opt-in rather than a perf knob). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Records the design idea for a per-call source-level opt-in. Captures: - Motivation: per-call granularity matches the property \`return_call\` provides (bounded stack at metered cost); equally valuable as a declarative diagnostic à la Scala \`@tailrec\`. Near- term workaround for \`mo:core\` recursive algorithms while the IC's per-instruction \`return_call\` cost remains elevated. - Surface syntax: both \`(with tailcall = true)\` and the punning \`(with tailcall)\` form, neither requiring grammar changes. - Typecheck constraint: must be compile-time-known Bool. - Lowering: bypass the flag-gated producer arm, go straight to \`TailCallPrim\` and reuse existing codegen. - Four open design questions: flag interaction, where to run the tail-position warning, self-recursion behaviour, indirect calls. Marked *proposed* (not delivered). Existing "Future work" bullets restructured so the annotation is the headline next step and "producer heuristics" is explicitly subsumed by it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ail-calls Reframes the indirect-tail-call follow-up around its actual use case — *computed* tail-calls, where the callee is a value chosen at runtime rather than a statically-known name. That's the complement of the delivered direct-call path: VM dispatchers that index into a handler table, CPS-transformed continuations, dynamic-dispatch interpreters. Section covers: - Motivation: what direct \`return_call\` *can't* do. - Codegen plumbing: \`Closure.call_closure\` gains an \`is_tail\` parameter; both backends touched symmetrically. - Wasm validation precondition: type-table entry's result type must match the enclosing function's. Motoko's uniform ABI satisfies this in practice but emission must enforce it explicitly. - Bench coverage: a sibling \`tailcall-computed.mo\` storing opcode handlers as closures isolates the indirect-call premium against the existing direct-call bench. - Cycle-cost expectation: unknown without measurement; \`call_indirect\` is already pricier than \`call\`, so the relative tail-call premium may differ. - Interaction with \`(with tailcall)\`: the annotation's "direct only" warning becomes unnecessary once this lands. Existing "Future work" bullet collapses to a forward reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lls\` Folds the experimental-tailcalls flag directly into the source so the bench unconditionally measures the TCO codegen path — what it's named for. Re-accepts the .ok against the EOP+TCO baseline: cycles = 25_416_088 (was 15_439_607) The earlier 15.4M number came from running before the producer + backend codegen landed AND with run-test's classical-persistence default; the new 25.4M is EOP + TCO, matching what \`make tailcall.only\` in this repo's bench Makefile actually exercises. Side-effect of TCO codegen: wasm-validate doesn't know \`return_call\` (opcode 0x12), so its expected output is the error itself (committed as \`tailcall.valid.ok\` / \`.valid.ret.ok\`). Could pass \`--enable-tail-call\` to wasm-validate in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Empirical: IC instruction-counter cost" section said the flag caused a ~65% cycle regression. That was a methodology error: the "baseline" 15.4M came from a committed .ok captured before the producer + backend landed, and \`run-test\`'s default persistence mode (classical) differs from the bench Makefile's (EOP). Honest same-tree, same-EOP-setting comparison: | build | cycles | | --- | --- | | no \`--experimental-tailcalls\` | 26_052_088 | | \`--experimental-tailcalls\` | 25_416_088 | | delta | -636_000 (TCO cheaper, ~2.5%) | Reframes the section: TCO is mildly *cheaper* on the cycle axis, in agreement with the cost table (Call=5, ReturnCall=3). The flag's primary value remains bounded-stack guarantees for mutual recursion / VM dispatchers / CPS, with the cycle reduction as a small bonus. Bench .ok was relocked to the with-flag EOP baseline in \`e24ebc003\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…\` bench The \`FuncE\` arm in \`tailcall.ml\` set \`tail_pos = true\` for every function body unconditionally. For Shared functions (post async- lowering: \`Shared+Replies\` for awaitables, \`Shared+Returns\` for one-shot oneway updates) that's wrong: the wasm-level wrapper has \`message_start ; user-body ; message_cleanup\` (state-machine transition + GC) — cleanup runs *below* the body, so the body is not in tail position from the wasm function's perspective. With \`--experimental-tailcalls\` set, the producer arm would emit \`TailCallPrim\` for the body's last call and codegen would lower to \`return_call \$\$lambda.N\`, bypassing the cleanup. The lifecycle state stays \`InUpdate\`, and the next message traps with "internal error: unexpected state entering InUpdate" when the runtime tries \`Idle → InUpdate\` again. Fix: gate \`tail_pos\` on \`s = Type.Local\`. Shared bodies are descended with \`tail_pos = false\` (via \`exp\` instead of \`tailexp\`), so the producer can no longer label the wrapper-to- body call as tail-position. Existing self-tail-recursion → loop rewrite for Local functions is preserved unchanged. Also adds the \`gauss\` bench: naïve self-recursive \`foldLeft (+) 0\` over [1..100] × 10_000. Complements the existing VM bench by exercising the loop-rewrite path (\`tailcall.ml:185-200\`) — today both flag-on and flag-off compile foldLeft to a wasm \`loop\`. When the loop-rewrite is later removed in favour of uniform \`return_call\` codegen, the cycle delta on \`gauss\` will be the load-bearing measurement. Pass \`--enable-tail-call\` to \`wasm-validate\` in the test runner so benches that emit \`return_call\` no longer trip on the validator side. Drops the \`tailcall.valid.ok\` / \`.valid.ret.ok\` files that were capturing the validator's \`unexpected opcode: 0x12\` error — no longer needed. Bench numbers (EOP, \`--experimental-tailcalls\`): go (VM, mutual TCO): 25_416_088 cycles / 1_000 iter gauss (loop-rewritten foldLeft): 110_520_270 cycles / 10_000 iter The \`gauss\` bench is what surfaced the codegen bug above — without it the trap would not have fired (the existing \`go\` alone runs as the *first* update message and hits the cleanup-bypass invisibly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tailcalls\`
When the flag is on, the existing self-tail-call → loop rewrite is
skipped, so self-recursive tail calls fall through to the
\`TailCallPrim\` arm and codegen emits \`return_call\` instead of the
\`loop { … local.set; br 0 }\` machinery. When the flag is off, the
loop rewrite still fires (current default behaviour preserved).
Measured on \`test/bench/tailcall.mo\` \`gauss\` (naïve self-recursive
\`foldLeft (+) 0 [1..100]\` × 10_000):
loop-rewrite (flag off / today): 110_520_270 cycles
return_call (flag on, new): 101_400_270 cycles
delta: -9_120_000 (~8.2% cheaper)
The loop rewrite has its own overhead — it copies args into mutable
temps and adds a \`loop\` / \`block\` wrapper — and per the IC's cost
table, \`return_call\` (3) plus the wasm-level arg-passing it would do
anyway beats that overhead in this benchmark. So removing the loop
rewrite in favour of uniform \`return_call\` codegen is genuinely
cheaper, not just stack-bounded.
The flag now means: "use \`return_call\` for ALL tail calls, both
self-tail and cross-function." Loop rewrite remains as the legacy
fallback for flag-off.
Bench .ok re-locked at \`gauss = 101_400_270\` (with-flag baseline);
\`go\` unchanged at \`25_416_088\`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wasm-exts to provide new instructions (e.g. SIMD!)return_call codegen via --experimental-tailcalls
…numbers Marks the tail-call slice on \`gabor/wasm-exts-sync\` as complete: - Stack table: adds the Shared-fn fix + gauss-bench commit (\`6491d605c\`) and the flag-gated loop-elision commit (\`e84769da5\`). - Design notes: replaces "self-recursion still loops" with the Shared-bodies-excluded fix and the flag-gated self-tail loop-rewrite story. - Empirical: adds the \`gauss\` table — self-tail-recursion under the flag is **8.2% cheaper** than the loop rewrite. Combined with mutual TCO's 2.5% (\`go\`), the flag is a strict win on every measured shape, contradicting the intuition that "loop rewrite is cheaper than return_call for self-tail." - Future work: drops the now-superseded "producer heuristics" bullet, adds a "default-on \`--experimental-tailcalls\`" item (flip default + remove loop rewrite once IC support is universal — empirical data already favours it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Names the bench after what it actually exercises — the Hutton/Bahr stack VM. \`hutton\` (mutual-tail-recursive VM) and \`gauss\` (self-tail-recursive foldLeft) now form a coherent named pair. Plan doc updated to track the rename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@crusso — pinging you on the tail-call slice. Self-contained and opt-in behind
So Two concrete future directions worth noting alongside what's already in the plan (§ Source-level annotation
Together with the Happy to chase any of these as separate follow-up PRs if there's interest. |
Pulls the conditional inside \`G.i\` so we just pick the constructor (\`ReturnCall\` vs \`Call\`) without duplicating the surrounding \`G.i (...)\` wrapper. The trailing \`FakeMultiVal.load\` is now unconditional: after \`return_call\` it sits in unreachable code, which the wasm validator accepts via the post-terminator polymorphic-stack rule. For arity ≤ 1 (the common case) \`FakeMultiVal.load\` is \`G.nop\` so this changes nothing on the wire; for arity > 1 it adds a few \`global.get\` bytes that never execute. Per @ggreif's review on PR #6043. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test-runner's \`extract_directive\` (\`test-runner/src/run_test.rs\`) does substring-find on \`//MOC-FLAG\` per line, no line-start anchor, so my prose mention \`//MOC-FLAG --experimental-tailcalls\` in the top-of-file comment was being parsed as a second directive — the trailing backtick after the matched substring then landed inside moc's flag string, and moc rejected \`--experimental-tailcalls\\`\` as an unknown option. Cheapest fix: rephrase the prose so the literal \`//MOC-FLAG\` substring no longer appears. Proper fix is to anchor \`extract_directive\` on line-start; left as a follow-up since this PR shouldn't change the test-runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Description all sounds sensible but haven't looked at code at all. Impressive! |
The code is surprisingly straightforward too! |
| cycle axis, so the migration is a strict improvement plus | ||
| bounded-stack. | ||
|
|
||
| ## Source-level annotation: `(with tailcall)` — *proposed* |
There was a problem hiding this comment.
As soon as the cycle counts are stabilised and the tail-call instructions are consistently cheaper than the stack-eaters, we'll flip the default for --experimental-tailcalls. Then we'll remove the flag from the algorithm (and with it the loop emulation) and reuse the flag for TCMC and other clever stuff.
| sensible recommendation for the `mo:core` library or only for niche | ||
| deep-recursion cases. | ||
|
|
||
| ### Interaction with `(with tailcall)` annotation |
There was a problem hiding this comment.
When the return_call* is cheaper (and better) in all regards, (with tailcall) won't make sense.
There was a problem hiding this comment.
It might make sense to understand it as a directive: error out if the tail-call is not realisable by the compiler mid-end transform.
ggreif
left a comment
There was a problem hiding this comment.
Do we want to document the flag to the user?
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`#6043` shipped tail-call codegen for direct `call`s only — closure- dispatched calls (anything reached via `call_indirect`) silently degraded to non-tail dispatch. mccarthy-style mutually tail-recursive `async* Bool` chains stack-overflow at ~1k hops because every `await*` goes through the closure dispatch path. Three tightly coupled changes flip indirect tail-calls on: 1. `Closure.call_closure` (both backends) gains an optional `~is_tail` parameter and emits `ReturnCallIndirect` when set. 2. The `_, Type.Local` arm in `compile_*.ml` (the closure-call producer in `CallPrim _ | TailCallPrim _ as cp`) threads `is_tail` through. 3. **The linker.** `linkModule.ml`'s `rename_types` pass renumbers type-table indices when merging RTS + user module — and its instruction matcher handled `CallIndirect` but NOT `ReturnCallIndirect`, so the new tail-call sites kept stale pre-merge type indices and the binary failed wasmtime's validator with bogus type signatures (e.g. `(func (result i64))` on a 4-i64-no-result call site). Validates and runs end-to-end on the IC: the new `mccarthy` row in `test/bench/tailcall.mo` (mutually tail-recursive `isEven`/`isOdd` in `async* Bool`, chained via `await*`) executes 100k mutual hops in ~547 cycles each, no stack growth. `tailcall.valid.{ok,ret.ok}` capture wabt's stricter (and arguably buggy) reading of `return_call_indirect` under `table64`: wabt accepts an i64 table-index for `call_indirect` but expects i32 for `return_call_indirect`. wasmtime treats them symmetrically and accepts our binary. Tracking these messages keeps the test honest to the toolchain we ship with — if wabt fixes the asymmetry the .ok files will need re-acceptance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the indirect-tail-call work out of "future work" / "proposed" and into the delivered table. Updates the codegen-scope note to reflect that closure dispatch now also emits `ReturnCallIndirect`, and records the linker-side fix (the bug that hid the codegen change behind a wasmtime validation error). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update: extending TCO to indirect calls + linker fixSpotted a real gap while drafting a follow-up bench: the original PR's TCO covered direct Fixed in commit
New
|
| Bench | Mode | Cycles |
|---|---|---|
| hutton | direct call → return_call |
25,416,088 |
| gauss | self-recursive foldLeft → return_call |
101,400,270 |
| mccarthy | indirect call_indirect → return_call_indirect (new) |
54,700,107 |
100k mutual hops, no stack growth. ~547 cycles per hop.
tailcall.valid.{ok,ret.ok} quirk
wabt's wasm-validate rejects return_call_indirect with an i64 table-index under table64 even though it accepts the same i64 index for plain call_indirect. wasmtime treats them symmetrically and accepts our binary; the canister runs end-to-end. The new valid.ok files capture wabt's error messages so the test harness stays honest about that toolchain asymmetry — if wabt closes the gap upstream we'll re-accept.
Lift the var-construction (`nr (mk_fi ())` / `nr table_index, nr ty`) out of both branches so the constructor choice is the only thing the `if` decides. Style nit suggested mid-review; same emitted code, no functional change. Applied to all four sites: direct + indirect, classical + enhanced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Filed the upstream fix for the wabt Once that lands and we bump our pinned wabt, the |
`return_call_indirect` under `table64` rejects an i64 table-index in
upstream wabt, while the matching `call_indirect` accepts it — a
clean asymmetry in the validator. The patched wabt mirrors the
working `OnCallIndirect` plumbing in `OnReturnCallIndirect` (4 lines
+ a regression test). With the patch, our `tailcall` bench's
`return_call_indirect` sites validate clean and the temporary
`tailcall.valid.{ok,ret.ok}` files this PR introduced are no longer
needed — drop them.
Patch is fetched from the open upstream PR via `fetchpatch`. Drop
this overlay (and the two-line `__intentionallyOverridingVersion`
shim) once the fix lands and nixpkgs picks it up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ackage.nix` Replaces the local `super.wabt.overrideAttrs + fetchpatch` overlay with a single-file flake input pinned at NixOS/nixpkgs#517726 (wabt 1.0.41) and a trivial `super.callPackage wabt-package-src {}` overlay. `type = "file"` fetches just the ~5 KB `package.nix` expression (narHash pinned in flake.lock) — no second nixpkgs tarball, no second nixpkgs evaluation. `super.callPackage` injects stdenv, cmake, python3, gtest, fetchFromGitHub, … from the host nixpkgs, so the resulting derivation is identical to what NixOS/nixpkgs#517726 would produce upstream. Wins over `5a8df27df`'s fetchpatch: - `wabt --version` prints `1.0.41` (was `1.0.40-pre-2744`); drops the `__intentionallyOverridingVersion = true` shim. - No fetchpatch hash drift to chase. - Cleanup at end-of-life is mechanical: drop the input, drop the overlay, no version-string unwind. `nix eval` confirms `pkgs.wabt.version == "1.0.41"`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erge commit PR was merged 2026-05-08 14:41 UTC. Switching the `type = "file"` input from the fork (`ggreif/nixpkgs@d26416cf`) to the upstream merge commit (`NixOS/nixpkgs@5a610f5b4f`). The narHash is identical — the merged file is byte-for-byte the fork's content, confirming the PR landed exactly as proposed. Verified locally: `nix develop --command wasm-validate --version` and `wat2wasm --version` both report `1.0.41`. Next cleanup (separate commit, when nixpkgs-unstable rolls past the merge): drop the input + the overlay entirely, falling back to plain `pkgs.wabt`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master switched to `nixpkgs-unstable` (PR #6105 / #6104) which carries wabt 1.0.41 by default, so the `wabt-package-src` flake input and the `super.callPackage` overlay introduced for #517726-bridging are no longer needed. Removes: - `wabt-package-src` input + outputs arg in `flake.nix` - `super.callPackage wabt-package-src {}` overlay in `nix/pkgs.nix` - corresponding entry from `flake.lock` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
End-to-end opt-in support for wasm tail calls (
return_call/return_call_indirectAST in wasm-exts;TailCallPrimIR; producer pass; codegen for direct calls), gated behind the new--experimental-tailcallsflag. With the flag on, all tail calls — both self-recursive and cross-function — lower to wasmreturn_call, giving bounded stack for mutually-recursive / VM-dispatcher-shaped code at a small cycle benefit relative to the existing self-tail-call → loop rewrite. With the flag off, current behaviour is preserved exactly.This PR was originally scoped as a broader
wasm-extssync to upstreamwasm-2.0.2(SIMD, reference types, multi-memory, …). After confirming that tail calls are not inwasm-2.0.2(the OCaml interpreter at that release predates the proposal merge — see.claude/plans/wasm-exts-update.md), the tail-call work was carved out as its own self-contained slice on top of the existing wasm-exts. The broader 2.0.2 sync remains future work.What's delivered
f2c69f3e7ReturnCall/ReturnCallIndirectAST variants, smart constructors, encoder (opcodes0x12/0x13), decoder (opcodes removed from the illegal-list)a7ebc369d--experimental-tailcallsflag wiring inflags.ml+moc.ml(no behaviour change yet)dd2d8337aTailCallPrim of Type.typ listdeclaration + exhaustive plumbing through interpreter, type-checker, effect inference, async lowering, both backendsef2c7fb0ccompile_classical.mlandcompile_enhanced.mldirect-call arm:is_tailderived from prim, emitsReturnCall <fi>and skips trailingFakeMultiVal.loadwhen the prim isTailCallPrim4dbdb1712tailcall.mlgains a third arm: in tail position, when the self-recursion loop-rewrite doesn't apply and--experimental-tailcallsis set, emitTailCallPriminstead ofCallPrim6491d605ctailcall.mlFuncEarm gatestail_pos = trueons = Type.Local. Shared bodies are descended withtail_pos = false, so the producer cannot label the wrapper-to-body call as tail-position. Fixes a cleanup-bypass trap that surfaced oncegaussexercised the post-update-message path. Also passes--enable-tail-calltowasm-validatein the test-runner (test-runner/src/run_test.rs).e84769da5tailcall.mlself-tail-call → loop rewrite is now gatednot !Flags.experimental_tailcalls. With the flag on, self-tail calls fall through toTailCallPrimand becomereturn_call; with the flag off, the legacy loop rewrite still fires.fb38c1b24,e24ebc003,6491d605c,e84769da5test/bench/tailcall.mo: Hutton/Bahr-style stack VM runningfak(mutual TCO) + naïve self-recursivefoldLeftrunning 5-Year-Old Gauss (loop-rewrite vsreturn_call). Self-documents via//MOC-FLAG --experimental-tailcalls.Empirical (EOP,
--experimental-tailcallson)huttonfak(10)× 1_000gaussfoldLeft (+) 0 [1..100]× 10_000Comparisons (same tree, same EOP setting, only the flag flipped):
hutton(mutual): −636_000 cycles (~2.5%) vscall. Pure cost-table delta (Call=5 → ReturnCall=3× dispatch count). Also bounded-stack — VM dispatchers can run forever without growing the wasm stack.gauss(self-tail): −9_120_000 cycles (~8.2%) vs the loop rewrite. The loop rewrite has its own overhead (mutable arg-temps +loop/blockwrapper) thatreturn_callskips. Removing the loop rewrite under the flag is cheaper, not just stack-bounded.Design notes
tailcall_optimizationruns afterasync_lowering(pipeline.ml:796-797), so by the time the producer arm sees aCallPrim, awaitable / IC calls have already been desugared. The producer therefore cannot mistakenly tag a shared call as a tail call.TailCallPrimis honoured in theSR.Const Const.Fun (..., mk_fi, _)arm (known function index → emitReturnCall <fi>). Closure calls (Type.Localviacall_indirect) and shared calls fall through to the regular path even if the prim isTailCallPrim. Extending toreturn_call_indirectfor computed tail-calls is described in the plan as a follow-up.return_callrequires the callee's wasm result type to match the enclosing function's. Motoko's all-multival-via-side-channel ABI gives every wasm function the result type[](or a singlei64), so the constraint holds trivially for direct calls between Motoko functions.public func foo() : async () = bodycompiles to a wasm wrapper of the formmessage_start ; user-body ; message_cleanup(state-machine transition + GC). The body is not in tail position from the wasm wrapper's perspective. Lettingreturn_callescape the body would skip the cleanup, leaving the lifecycle stuck atInUpdate— the next update message then traps with'internal error: unexpected state entering InUpdate'. The fix is to descend Shared FuncE bodies withtail_pos = false.Future work (scoped out, kept in the plan)
wasm-exts2.0.2 sync — the original ambition of this branch (SIMD, reference types, multi-memory, exception handling, …). Tail calls are not in 2.0.2 anyway, so the slice was carved out independently.(with tailcall)— per-call surface knob (presence-form via record-punning on alet tailcall = true) with a compile-timeBool-const constraint and a tail-position warning. Useful workaround formo:corerecursive algorithms.return_call_indirectfor computed tail-calls — VM dispatchers that index into a handler table, CPS continuations, dynamic dispatch in interpreters. Requires extendingClosure.call_closure. The wasm-exts AST/codec already supports it; codegen lowering does not.See
.claude/plans/wasm-exts-update.mdfor the full design treatment, including open design questions for the annotation and the IC pricing measurements that informed the empirical comparison.Related
.claude/plans/abstract-interpreter.md— background on the ConstTrack abstract interpreter (relevant when emitting new instructions).Operator::ReturnCallIndirectcost from 60 → 6 (10× cut), affecting the proposedreturn_call_indirectfollow-up but not this PR's direct-call path.🤖 Generated with Claude Code