Motivation
A renderer's output for render(messages[:k]) vs render(messages[:k+1]) can differ in non-obvious ways depending on the template's history-handling semantics. Today this knowledge is implicit — each renderer's emit_* helpers encode it, but consumers (SFT dataloaders, RL trajectory builders, inference incremental rendering) have to either hard-code per-template assumptions or assume the worst case.
Two concrete consumer problems this causes:
- SFT loses intermediate reasoning on non-prefix-stable templates.
build_training_sample does a single full-conversation render. For Qwen3/GLM5 defaults, that single render strips reasoning from every assistant turn except the last (history-stripping). The model is trained to produce reasoning only on the final turn. Workarounds today: set preserve_all_thinking=True (changes inference semantics too) or use build_incremental_token_mask (errors out on non-prefix-stable templates — IncrementalTokenizationError). Neither is great.
- Inference bridges can't short-circuit.
RendererClient._get_incremental_prompt_ids always calls bridge_to_next_turn, even when the renderer is fully prefix-stable and a naive concat would be correct. The bridge does real work it doesn't need to.
The renderer is the source of truth for these semantics — it has to be, because it implements the emit logic. Today it doesn't expose that truth.
Proposed API
A declarative property describing which append-boundaries preserve the rendered prefix:
# renderers/base.py
from dataclasses import dataclass
from typing import Literal
Boundary = Literal["tool", "user", "system", "developer"]
@dataclass(frozen=True)
class RenderStability:
"""Which append-boundaries preserve the rendered prefix.
``boundary in preserves_through`` means: for any messages M and a single
appended message m where ``m["role"] == boundary``, ``render(M).token_ids``
is a prefix of ``render(M + [m]).token_ids``. The bridge for that boundary
is then a trivial "append new tokens" — no history transformation.
Boundaries *not* in the set may still be appendable (via ``bridge_to_next_turn``),
but the renderer transforms earlier tokens (e.g. strips reasoning from a prior
assistant when a new user turn arrives).
"""
preserves_through: frozenset[Boundary]
@property
def fully_stable(self) -> bool:
return self.preserves_through >= {"tool", "user", "system", "developer"}
class Renderer:
@property
def stability(self) -> RenderStability: ...
Per-renderer values
| Renderer |
preserves_through |
Notes |
| Qwen3 default |
{"tool"} |
Stable within a tool cycle; user boundary strips assistant reasoning. |
Qwen3 + preserve_thinking_between_tool_calls=True |
{"tool"} |
Same — that flag only preserves reasoning within the current cycle, which is the cycle's default for emit. |
Qwen3 + preserve_all_thinking=True |
{"tool", "user", "system", "developer"} |
Reasoning preserved across every boundary. |
| GLM5 / variants |
analogous to Qwen3 |
|
| Kimi K2, K2.5 |
TBD by template emit logic |
|
| DeepSeek V3, Nemotron 3, Laguna XS.2 |
TBD |
|
| GPT-OSS |
TBD (Harmony format) |
|
DefaultRenderer (Jinja) |
frozenset() |
Opaque template, assume nothing. |
Stability is dynamic w.r.t. construction-time flags (preserve_all_thinking, preserve_thinking_between_tool_calls): the same renderer class can advertise different stability depending on init args. Cache key into shared pools already includes these flags (verifiers/clients/renderer_client.py:486-494).
Consumer impact
SFT (in prime-rl)
build_training_sample queries renderer.stability:
fully_stable → current single-render path is correct.
"tool" in preserves_through but not "user" (Qwen3/GLM5 default) → split conversation at user boundaries; for each assistant-terminated segment, produce a separate training sample. Each captures that turn's reasoning under inference-correct history-stripping. Dataloader fans out 1 conversation → N samples.
- empty (
DefaultRenderer) → fall back to per-assistant render. Slowest, always correct.
The fan-out shifts batch sizing semantics: "batch of 32 conversations" becomes "batch of sum(stages_per_conversation) samples." Worth deciding upfront whether to expand at the dataloader level or pack N segments into one sample with attention isolation.
RL (verifiers / prime-rl orchestrator)
bridge_to_next_turn stays the canonical correctness path. When renderer.stability.fully_stable, the caller can skip the bridge and do a pure concat — an optimization, not a correctness change. Existing code keeps working.
Inference
Same — RendererClient._get_incremental_prompt_ids can branch on stability and skip the bridge dispatch for fully-stable renderers.
Open questions
- Include
"developer"? GPT-OSS already handles a developer role (gpt_oss.py:442) for Harmony / Responses-API messages. Including it keeps GPT-OSS first-class without consumer-side special-casing. Templates that don't use developer messages just never see them — the flag is moot for them.
- Per-boundary bridge declarations? Could extend to
RenderStability.bridge_required: frozenset[Boundary] so consumers know which bridges do non-trivial work vs. trivial-append. Today every renderer has one bridge_to_next_turn; this would be a separate signal.
- Tools change as a boundary? Today
tools is passed alongside messages. If the tool list changes between renders (e.g. tool added mid-trajectory), the system-section content changes and prefix-stability breaks regardless of message role. Should the API model that too? (Probably yes, as a separate stable_under_tools_change: bool.)
Implementation sketch
- Renderers side: add
stability property to each renderer class. Most are one-liners returning a fixed RenderStability. Flag-aware ones (Qwen3, GLM5) compute it from their _preserve_*_thinking attributes.
- New module:
renderers.stability exporting Boundary, RenderStability, helper constants for common cases (e.g. FULLY_STABLE, STABLE_IN_TOOL_CYCLE).
- Tests: parametrize the existing fixtures from
tests/test_sampled_mask.py to also assert that render(M + [m]) extends render(M) exactly when m["role"] in stability.preserves_through. Catches drift between declared and actual behavior.
Related
Motivation
A renderer's output for
render(messages[:k])vsrender(messages[:k+1])can differ in non-obvious ways depending on the template's history-handling semantics. Today this knowledge is implicit — each renderer'semit_*helpers encode it, but consumers (SFT dataloaders, RL trajectory builders, inference incremental rendering) have to either hard-code per-template assumptions or assume the worst case.Two concrete consumer problems this causes:
build_training_sampledoes a single full-conversation render. For Qwen3/GLM5 defaults, that single render strips reasoning from every assistant turn except the last (history-stripping). The model is trained to produce reasoning only on the final turn. Workarounds today: setpreserve_all_thinking=True(changes inference semantics too) or usebuild_incremental_token_mask(errors out on non-prefix-stable templates —IncrementalTokenizationError). Neither is great.RendererClient._get_incremental_prompt_idsalways callsbridge_to_next_turn, even when the renderer is fully prefix-stable and a naive concat would be correct. The bridge does real work it doesn't need to.The renderer is the source of truth for these semantics — it has to be, because it implements the emit logic. Today it doesn't expose that truth.
Proposed API
A declarative property describing which append-boundaries preserve the rendered prefix:
Per-renderer values
preserves_through{"tool"}preserve_thinking_between_tool_calls=True{"tool"}preserve_all_thinking=True{"tool", "user", "system", "developer"}DefaultRenderer(Jinja)frozenset()Stability is dynamic w.r.t. construction-time flags (
preserve_all_thinking,preserve_thinking_between_tool_calls): the same renderer class can advertise differentstabilitydepending on init args. Cache key into shared pools already includes these flags (verifiers/clients/renderer_client.py:486-494).Consumer impact
SFT (in
prime-rl)build_training_samplequeriesrenderer.stability:fully_stable→ current single-render path is correct."tool" in preserves_throughbut not"user"(Qwen3/GLM5 default) → split conversation at user boundaries; for each assistant-terminated segment, produce a separate training sample. Each captures that turn's reasoning under inference-correct history-stripping. Dataloader fans out 1 conversation → N samples.DefaultRenderer) → fall back to per-assistant render. Slowest, always correct.The fan-out shifts batch sizing semantics: "batch of 32 conversations" becomes "batch of
sum(stages_per_conversation)samples." Worth deciding upfront whether to expand at the dataloader level or pack N segments into one sample with attention isolation.RL (verifiers / prime-rl orchestrator)
bridge_to_next_turnstays the canonical correctness path. Whenrenderer.stability.fully_stable, the caller can skip the bridge and do a pure concat — an optimization, not a correctness change. Existing code keeps working.Inference
Same —
RendererClient._get_incremental_prompt_idscan branch onstabilityand skip the bridge dispatch for fully-stable renderers.Open questions
"developer"? GPT-OSS already handles adeveloperrole (gpt_oss.py:442) for Harmony / Responses-API messages. Including it keeps GPT-OSS first-class without consumer-side special-casing. Templates that don't use developer messages just never see them — the flag is moot for them.RenderStability.bridge_required: frozenset[Boundary]so consumers know which bridges do non-trivial work vs. trivial-append. Today every renderer has onebridge_to_next_turn; this would be a separate signal.toolsis passed alongsidemessages. If the tool list changes between renders (e.g. tool added mid-trajectory), the system-section content changes and prefix-stability breaks regardless of message role. Should the API model that too? (Probably yes, as a separatestable_under_tools_change: bool.)Implementation sketch
stabilityproperty to each renderer class. Most are one-liners returning a fixedRenderStability. Flag-aware ones (Qwen3, GLM5) compute it from their_preserve_*_thinkingattributes.renderers.stabilityexportingBoundary,RenderStability, helper constants for common cases (e.g.FULLY_STABLE,STABLE_IN_TOOL_CYCLE).tests/test_sampled_mask.pyto also assert thatrender(M + [m])extendsrender(M)exactly whenm["role"] in stability.preserves_through. Catches drift between declared and actual behavior.Related
sampled_maskAND inbuild_training_sample(closes the narrower "trains on<|im_start|>assistant\nscaffolding" issue)message_indices/sampled_mask/message_roleson bridge outputpreserve_*_thinkingflags through prime-rl's RendererConfig<|im_start|>assistant\nscaffolding (use rendererssampled_mask) prime-rl#2492 — narrower SFT scaffolding bug, fixed by feat(base): add RenderedTokens.sampled_mask for SFT/RL parity #33