Skip to content

feat: renderer self-describes prefix-stability for SFT / RL / inference consumers #41

@hallerite

Description

@hallerite

Motivation

A renderer's output for render(messages[:k]) vs render(messages[:k+1]) can differ in non-obvious ways depending on the template's history-handling semantics. Today this knowledge is implicit — each renderer's emit_* helpers encode it, but consumers (SFT dataloaders, RL trajectory builders, inference incremental rendering) have to either hard-code per-template assumptions or assume the worst case.

Two concrete consumer problems this causes:

  1. SFT loses intermediate reasoning on non-prefix-stable templates. build_training_sample does a single full-conversation render. For Qwen3/GLM5 defaults, that single render strips reasoning from every assistant turn except the last (history-stripping). The model is trained to produce reasoning only on the final turn. Workarounds today: set preserve_all_thinking=True (changes inference semantics too) or use build_incremental_token_mask (errors out on non-prefix-stable templates — IncrementalTokenizationError). Neither is great.
  2. Inference bridges can't short-circuit. RendererClient._get_incremental_prompt_ids always calls bridge_to_next_turn, even when the renderer is fully prefix-stable and a naive concat would be correct. The bridge does real work it doesn't need to.

The renderer is the source of truth for these semantics — it has to be, because it implements the emit logic. Today it doesn't expose that truth.

Proposed API

A declarative property describing which append-boundaries preserve the rendered prefix:

# renderers/base.py
from dataclasses import dataclass
from typing import Literal

Boundary = Literal["tool", "user", "system", "developer"]

@dataclass(frozen=True)
class RenderStability:
    """Which append-boundaries preserve the rendered prefix.

    ``boundary in preserves_through`` means: for any messages M and a single
    appended message m where ``m["role"] == boundary``, ``render(M).token_ids``
    is a prefix of ``render(M + [m]).token_ids``. The bridge for that boundary
    is then a trivial "append new tokens" — no history transformation.

    Boundaries *not* in the set may still be appendable (via ``bridge_to_next_turn``),
    but the renderer transforms earlier tokens (e.g. strips reasoning from a prior
    assistant when a new user turn arrives).
    """
    preserves_through: frozenset[Boundary]

    @property
    def fully_stable(self) -> bool:
        return self.preserves_through >= {"tool", "user", "system", "developer"}


class Renderer:
    @property
    def stability(self) -> RenderStability: ...

Per-renderer values

Renderer preserves_through Notes
Qwen3 default {"tool"} Stable within a tool cycle; user boundary strips assistant reasoning.
Qwen3 + preserve_thinking_between_tool_calls=True {"tool"} Same — that flag only preserves reasoning within the current cycle, which is the cycle's default for emit.
Qwen3 + preserve_all_thinking=True {"tool", "user", "system", "developer"} Reasoning preserved across every boundary.
GLM5 / variants analogous to Qwen3
Kimi K2, K2.5 TBD by template emit logic
DeepSeek V3, Nemotron 3, Laguna XS.2 TBD
GPT-OSS TBD (Harmony format)
DefaultRenderer (Jinja) frozenset() Opaque template, assume nothing.

Stability is dynamic w.r.t. construction-time flags (preserve_all_thinking, preserve_thinking_between_tool_calls): the same renderer class can advertise different stability depending on init args. Cache key into shared pools already includes these flags (verifiers/clients/renderer_client.py:486-494).

Consumer impact

SFT (in prime-rl)

build_training_sample queries renderer.stability:

  • fully_stable → current single-render path is correct.
  • "tool" in preserves_through but not "user" (Qwen3/GLM5 default) → split conversation at user boundaries; for each assistant-terminated segment, produce a separate training sample. Each captures that turn's reasoning under inference-correct history-stripping. Dataloader fans out 1 conversation → N samples.
  • empty (DefaultRenderer) → fall back to per-assistant render. Slowest, always correct.

The fan-out shifts batch sizing semantics: "batch of 32 conversations" becomes "batch of sum(stages_per_conversation) samples." Worth deciding upfront whether to expand at the dataloader level or pack N segments into one sample with attention isolation.

RL (verifiers / prime-rl orchestrator)

bridge_to_next_turn stays the canonical correctness path. When renderer.stability.fully_stable, the caller can skip the bridge and do a pure concat — an optimization, not a correctness change. Existing code keeps working.

Inference

Same — RendererClient._get_incremental_prompt_ids can branch on stability and skip the bridge dispatch for fully-stable renderers.

Open questions

  1. Include "developer"? GPT-OSS already handles a developer role (gpt_oss.py:442) for Harmony / Responses-API messages. Including it keeps GPT-OSS first-class without consumer-side special-casing. Templates that don't use developer messages just never see them — the flag is moot for them.
  2. Per-boundary bridge declarations? Could extend to RenderStability.bridge_required: frozenset[Boundary] so consumers know which bridges do non-trivial work vs. trivial-append. Today every renderer has one bridge_to_next_turn; this would be a separate signal.
  3. Tools change as a boundary? Today tools is passed alongside messages. If the tool list changes between renders (e.g. tool added mid-trajectory), the system-section content changes and prefix-stability breaks regardless of message role. Should the API model that too? (Probably yes, as a separate stable_under_tools_change: bool.)

Implementation sketch

  • Renderers side: add stability property to each renderer class. Most are one-liners returning a fixed RenderStability. Flag-aware ones (Qwen3, GLM5) compute it from their _preserve_*_thinking attributes.
  • New module: renderers.stability exporting Boundary, RenderStability, helper constants for common cases (e.g. FULLY_STABLE, STABLE_IN_TOOL_CYCLE).
  • Tests: parametrize the existing fixtures from tests/test_sampled_mask.py to also assert that render(M + [m]) extends render(M) exactly when m["role"] in stability.preserves_through. Catches drift between declared and actual behavior.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions