Skip to content

feat(base): per-token is_content mask for body/scaffold attribution#53

Open
snimu wants to merge 4 commits into
mainfrom
sebastian/content-mask-2026-05-19
Open

feat(base): per-token is_content mask for body/scaffold attribution#53
snimu wants to merge 4 commits into
mainfrom
sebastian/content-mask-2026-05-19

Conversation

@snimu
Copy link
Copy Markdown

@snimu snimu commented May 19, 2026

Summary

  • Adds is_content: list[bool] to RenderedTokens — a per-token signal that generalises sampled_mask across all roles: True iff the token came from message-body bytes (caller-provided content / tool_calls / reasoning_content, or the model's sampled emission for assistant), False iff template scaffolding (role tags, closers when not sampled, separators, tool-response wraps, tools-header block, generation prompt).
  • Adds build_training_sample(..., content_sft_roles={"tool"}) so a single render produces a loss mask that combines RL on assistant tokens with SFT on tool response bodies — without supervising the surrounding <|tool_response> / role-tag specials that would interrupt a real rollout.
  • Wires the mask through every hand-coded renderer (13 of them). Token IDs are byte-identical with apply_chat_template.
  • renderers.client.generate() returns the renderer's per-token attribution as prompt_attribution: RenderedTokens, so downstream consumers (verifiers RendererClient → prime-rl) carry the body/scaffold cut to the trainer without re-rendering.

Motivation

For RL the policy loss applies only to tokens the model emitted. A useful auxiliary objective is SFT on tool response bodies — supervise the model to anticipate what tools return, without supervising the wrap. If the model learns to emit <|tool_response> itself, it can derail a rollout by short-circuiting the harness.

sampled_mask answers "would the model emit this?", which is the right cut for assistant tokens but is uniformly False on non-assistant roles. There is no way to ask "which tokens came from message-body bytes" on tool / user / system messages using sampled_mask alone.

is_content is that signal. For a tool message wrapped as <|im_start|>user\n<|tool_response>\n{body}\n<|tool_response_end|><|im_end|>\n, is_content is True only on the {body} tokens — never on the <|tool_response> specials or the inter-section newlines.

By construction is_content == sampled_mask over every assistant-attributed token; on every other role sampled_mask is uniformly False and is_content carries information sampled_mask cannot. is_content is a strict superset (or equal) of sampled_mask everywhere and never contradicts it.

API

On RenderedTokens:

  • is_content: list[bool] — same length / empty policy as sampled_mask. Empty means the renderer opts out (DefaultRenderer leaves it empty for the same reason it leaves sampled_mask empty: Jinja is opaque).
  • content_token_spans_by_role() -> dict[str, list[tuple[int, int]]] — contiguous body-only token runs grouped by message role.
  • content_mask_for_roles(roles) -> list[bool] — per-token bool mask, True only on body tokens whose message role is in the supplied set.

Module-level in renderers.base:

  • attribute_text_segments(tokenizer, segments) — single-BPE-pass attribution via offset_mapping. When the supplied tokenizer doesn't track offsets (fastokens patch), lazy-loads a vanilla offset-capable tokenizer for the same model and caches it process-globally.
  • build_training_sample(..., content_sft_roles=...) — opt-in body-only supervision for roles the model never samples. Falls back to the role_to_mask + sampled_mask behaviour when is_content is empty.

On renderers.client.generate():

  • Result field prompt_attribution: RenderedTokens | None — the per-token attribution for the prompt, either the one this call computes via render() internally or the one the caller threaded in alongside prompt_ids. Downstream consumers call attr.content_mask_for_roles({"tool"}) on it to build selective loss masks without re-rendering.
  • Parameter prompt_attribution: RenderedTokens | None = None — callers that pre-build prompt_ids (the multi-turn bridge path in verifiers) hand in the RenderedTokens that bridge_to_next_turn returned, and it surfaces on the result unchanged.

The new field on generate() mirrors the existing multi_modal_data sidecar — same shape, same None-default-when-unknown semantics.

How it works

Every renderer has emit sites like emit_text("user\n" + content, ...) that join wrap text and body text into one BPE pass to preserve token merges at the boundary. The emit_text_segments(...) helper (defined locally in each renderer) does the same join with per-token attribution:

  1. Concatenate segment texts and run a single BPE encode.
  2. Use the fast tokenizer's offset_mapping to recover each token's character span.
  3. Attribute each token to whichever segment contains its first source character.

fastokens (the Rust BPE patched in by default for ~10x faster encode) doesn't track offsets. attribute_text_segments transparently loads a vanilla offset-capable tokenizer for the same model and caches it process-globally per model name. Most models in MODEL_RENDERER_MAP produce byte-identical token IDs between fastokens and vanilla, so the mix is safe; models in FASTOKENS_INCOMPATIBLE already use vanilla everywhere.

A few renderers use tokenizers that can't provide offset mapping at all and rely on per-renderer alternatives:

  • Kimi K2 family uses TikTokenTokenizer. Avoids concatenated wrap+body emits to begin with — Kimi's structure splits wrap and body at special-token boundaries, so threading is_content through the split emits suffices.
  • gpt-oss (Harmony) has an opaque prefix block. Diffs the rendered prefix against an empty-instructions render to recover the developer-instructions body span inside it.
  • MiniMax M2 has a BPE-merge between <response> and the body's first letter under certain tokenizer load orders. A local emit_token_overlap_body helper picks the overlap rule so the body's leading byte stays recoverable from its body run.

Per-renderer coverage

Renderer Notes
qwen3 Reference implementation.
qwen3.5 XML-style tool calls. Auto-detected enable_thinking polarity preserved.
qwen3.6 Inherits from Qwen35Renderer; only overrides a pure string serializer, so it picks up is_content through the parent class.
qwen3-vl <|image_pad|> placeholders are body (is_content=True); the surrounding <|vision_start|> / <|vision_end|> are scaffold.
glm5 / glm5.1 GLM5Renderer covers both via subclass. Also covers zai-org/GLM-4.7-Flash.
glm4.5 <|observation|> / <tool_response> wraps are scaffold; body is content.
kimi-k2 TikTokenTokenizer — uses existing split-emit boundaries (no attribute_text_segments).
kimi-k2.5 / kimi-k2.6 Multimodal: <|media_pad|> is body; <|media_begin|>...<|media_end|> wrap is scaffold.
minimax-m2 XML-style tool calls. FASTOKENS_INCOMPATIBLE (vanilla everywhere). Local overlap helper for <response> BPE merge.
deepseek-v3 FASTOKENS_INCOMPATIBLE (Metaspace pretokenizer). Standard wrap/body split.
nemotron-3 Tool body uses emit_text_segments for \n boundaries.
laguna-xs.2 Default-system header text is scaffold; caller-supplied system content is body.
gpt-oss Harmony format. functions.{name} text on tool result messages is scaffold (comes from prior assistant tool_calls, not this tool's content).

DefaultRenderer leaves is_content empty.

Tests

tests/test_is_content.py — 10 invariants × 17-model matrix:

  • Length matches token_ids or is empty (opt-out).
  • is_content == sampled_mask over assistant tokens.
  • Generation-prompt tokens are is_content=False.
  • User / tool / system bodies are recoverable from the decoded is_content=True run.
  • First role-tag token is is_content=False.
  • content_token_spans_by_role() isolates tool body cleanly.
  • content_mask_for_roles({"tool"}) excludes assistant.
  • build_training_sample(..., content_sft_roles={"tool"}) trains tool body + assistant, never user.

tests/test_client.py covers the prompt_attribution surface on generate():

  • The parse-and-build test asserts prompt_attribution carries every populated RenderedTokens field through verbatim.
  • The bridge shape (caller passes both prompt_ids and prompt_attribution) passes attribution through unchanged.
  • The pre-built-prompt-without-attribution path returns None so callers can detect the gap.

Full suite collects 1557 tests — all pass (modulo pre-existing gpt-oss HF-parity skips and one unrelated xfailed). test_render_ids byte-identity vs apply_chat_template is green on every renderer.

Additional fixes

  • nemotron3: message_roles was sourced from the auto-injected normalised list, off-by-one when a default system was prepended. Now indexes the caller-provided message list.
  • kimi_k2: same off-by-one fixed via a caller_messages snapshot.

Notes for the maintainer

  • bridge_to_next_turn populates is_content on the bridge-emitted portion only; the prior portion (previous_prompt_ids + previous_completion_ids) gets [False] * len(previous_ids) per the same convention sampled_mask follows on bridge output. Consumers walk the trajectory and read each step's own is_content for full-conversation body masks.
  • The vanilla tokenizer loaded for offset_mapping is cached process-globally per model name (not per pool), so a 32-slot pool of the same model adds exactly one extra tokenizer to memory, not 32.

Note

Add per-token is_content body/scaffold attribution mask to all renderers

  • Adds an is_content: list[bool] field to RenderedTokens in renderers/base.py that marks each token as caller/model body (True) or template scaffolding (False).
  • Introduces attribute_text_segments in renderers/base.py to tokenize concatenated (text, is_content) segments in a single BPE pass using offset mapping, preserving merge boundaries while attributing each token to the correct segment.
  • Implements is_content population across all renderers (qwen3, qwen35, qwen3_vl, deepseek_v3, gpt_oss, kimi_k2, kimi_k25, laguna_xs2, minimax_m2, nemotron3, glm45, glm5), including render, bridge/bridge_to_next_turn, and assistant/tool helpers.
  • Extends build_training_sample with a content_sft_roles parameter that restricts loss to body-only tokens for specified roles using is_content, leaving behavior unchanged when the field is absent or empty.
  • Adds content_token_spans_by_role and content_mask_for_roles helpers to RenderedTokens for downstream span extraction.
  • Behavioral Change: assistant tokens enforce is_content == sampled_mask; message_roles in some renderers now reflects the original caller message list rather than the post-normalized list.

Macroscope summarized 281d89b.

snimu and others added 2 commits May 19, 2026 13:36
Generalizes sampled_mask across all roles. is_content[k] is True iff
token k came from message-body bytes — caller-provided content /
tool_calls / reasoning_content, or the model's sampled emission for
assistant — and False iff template scaffolding (role tags, closers
when not sampled, inter-turn separators, tool-response wraps,
tools-header block, generation prompt). By construction is_content ==
sampled_mask over every assistant-attributed token; carries new
information on every other role where sampled_mask is uniformly False.

Enables SFT on tool response bodies while applying RL only to
assistant tokens — build_training_sample(..., content_sft_roles={"tool"})
trains the model to anticipate tool outputs without learning to emit
the surrounding <|tool_response>/role-tag scaffold (which would
interrupt a real rollout).

New on RenderedTokens:
- is_content: list[bool] field (empty when the renderer opts out, same
  policy as sampled_mask)
- content_token_spans_by_role()
- content_mask_for_roles(roles)

New module-level helpers in base.py:
- attribute_text_segments(tokenizer, segments) — single-BPE-pass
  attribution via offset_mapping; auto-loads a vanilla offset-capable
  tokenizer when the supplied one doesn't track offsets (fastokens
  patch), cached process-globally per model name.
- build_training_sample(..., content_sft_roles=...) — opt-in body-only
  supervision for roles the model never samples. Falls back to the
  prior role_to_mask + sampled_mask behaviour when is_content is
  empty.

Wired through every hand-coded renderer: qwen3, qwen3.5, qwen3.6
(inherits), qwen3-vl, glm5, glm5.1, glm4.5, kimi-k2, kimi-k2.5/2.6,
minimax-m2, deepseek-v3, nemotron-3, laguna-xs.2, gpt-oss. Concatenated
wrap+body emits go through emit_text_segments (or per-renderer
equivalents) so BPE merges at the boundary stay byte-identical with the
prior single-emit path. Renderers whose tokenizer doesn't support
offset_mapping (Kimi, MiniMax with its known fastokens edge case) use
boundary-aware emit patterns or a per-renderer overlap rule to keep
body bytes recoverable.

Multimodal placeholders (<|image_pad|>, <|media_pad|>) are body
(is_content=True) — they represent caller-provided image data in
token form. The surrounding vision/media wrap specials are scaffold.

Fixed along the way:
- nemotron3: off-by-one in message_roles when a default system was
  auto-injected. Now indexes the caller-provided message list.
- kimi_k2: same off-by-one fixed via a caller_messages snapshot.

Tests: 10 invariants × 17-model matrix in tests/test_is_content.py.
Token IDs stay byte-identical vs apply_chat_template across every
renderer; existing test_render_ids / test_sampled_mask / test_bridge /
test_build_helpers / test_tokens_per_message / test_roundtrip /
test_multimodal stay green. DefaultRenderer leaves is_content empty
(Jinja is opaque), same policy as sampled_mask.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Output of `uv run ruff format` after wiring is_content through the
renderers in the previous commit. No semantic changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@snimu snimu requested a review from hallerite May 19, 2026 12:11
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 19, 2026

Approvability

Verdict: Needs human review

New feature adding per-token body/scaffold attribution (is_content mask) across 15+ renderer files, with changes to build_training_sample that affect how training loss masks are computed. The scope, complexity, and training-behavior impact warrant human review.

You can customize Macroscope's approvability policy. Learn more.

@snimu
Copy link
Copy Markdown
Author

snimu commented May 19, 2026

@macroscope-app

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 19, 2026

Hi! How can I help? I can:

  • Run a full code review on this PR
  • Answer questions about the code or architecture
  • Help fix issues (if you identify any)

Just let me know what you need!

@snimu
Copy link
Copy Markdown
Author

snimu commented May 19, 2026

please run a correctness check

@snimu
Copy link
Copy Markdown
Author

snimu commented May 19, 2026

@macroscope-app please run a correctness check

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 19, 2026

Correctness check is running. Results will be posted when complete.

Previously completed check runs:

Comment thread renderers/base.py Outdated
snimu and others added 2 commits May 19, 2026 14:25
…y_role

content_token_spans_by_role only checked that is_content matched
token_ids length, but it then walks spans returned by
message_token_spans() — whose values are indices into message_indices.
When len(message_indices) > len(token_ids), the span end can exceed
len(is_content), and the inner loop dereferences self.is_content[k]
past the end, raising IndexError.

The sister method content_mask_for_roles already checks both lengths
(introduced in the same commit). Match the precedent.

Not reachable through the renderer pipeline today — every hand-coded
renderer populates the four parallel lists in lock-step — but the
dataclass doesn't enforce the invariant, so a manually-constructed
RenderedTokens (e.g. a test fixture) could trip it. The guard is
cheap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate() already calls renderer.render() internally when the caller
doesn't pre-supply prompt_ids, producing a RenderedTokens that carries
token_ids, message_indices, sampled_mask, is_content, message_roles,
and multi_modal_data. Previously we surfaced only token_ids and
multi_modal_data and dropped the rest at the function boundary.
Callers that wanted per-token attribution downstream (verifiers'
RendererClient → prime-rl, for SFT-on-tool-body / selective loss
masking) had no way to recover it without a second render pass.

Two surfaces:

- New return field ``prompt_attribution``: the full RenderedTokens for
  the prompt — either the one this call computed via render() or the
  one the caller threaded in alongside prompt_ids. Downstream consumers
  call e.g. ``content_mask_for_roles({"tool"})`` on it to build the
  SFT-on-tool-body mask without re-rendering.
- New optional parameter ``prompt_attribution``: callers that pre-built
  prompt_ids (the multi-turn bridge path in verifiers) can hand in the
  RenderedTokens that bridge_to_next_turn returned, and it surfaces on
  the result unchanged.

Mirrors the existing multi_modal_data plumbing — same shape, same
None-default-when-unknown semantics. No behavioural change for callers
that don't read the new field.

Tests:

- Existing parse-and-build test now asserts prompt_attribution carries
  every populated RenderedTokens field through verbatim.
- New test for the pre-built-prompt-with-attribution path (the bridge
  shape verifiers will use).
- New assertion that pre-built-prompt-without-attribution returns
  ``None`` so callers can detect the gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@snimu
Copy link
Copy Markdown
Author

snimu commented May 19, 2026

@macroscope-app please run a correctness check again

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 19, 2026

Correctness check complete. No issues found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant