Skip to content

Make renderers.client backend-agnostic (vLLM, SGLang, ...) #49

@hallerite

Description

@hallerite

Problem

renderers.client.generate(...) is the online-HTTP path users reach for when they want a renderer-driven rollout against a running inference server. Today it is hard-wired to vLLM:

  • Endpoint. Posts to /inference/v1/generate — a vLLM 0.20 extension. SGLang serves its analog at /generate with a different body shape.
  • Body schema. { "model", "token_ids", "sampling_params", "features" } matches vLLM's /inference/v1/generate.
  • Response shape. data.choices[0].token_ids, routed_experts, the vLLM logprobs envelope — these are vLLM-specific extensions.
  • MM payload. _build_mm_features / _build_qwen_vl_features import vllm.entrypoints.serve.disagg.mm_serde and vllm.multimodal.inputs.MultiModalKwargsItems directly; the encoded base64 items are vLLM's MultiModalKwargsItem wire format.
  • Context-length error handling (added in feat(client): pre-flight overflow check with auto-discovery #48): the context-length phrase list and ModelCard.max_model_len lookup on the verifier side both assume vLLM-shaped responses.

However, not all downstream consumers use vLLM. Hence it's reasonable to also offer a stock SGLang client. To do that, we could make the client that we currently have more extensible.

Proposal

Factor out an explicit RendererBackend protocol in renderers/client.py and let users pick the backend at call time. Sketch:

class RendererBackend(Protocol):
    async def generate(
        self,
        *,
        client: AsyncOpenAI,
        token_ids: list[int],
        sampling_params: dict,
        multi_modal_data: MultiModalData | None,
        ...
    ) -> EngineResponse: ...

    async def resolve_max_prompt_len(
        self, *, client: AsyncOpenAI, model: str
    ) -> int | None: ...

    def is_overflow_error(self, exc: BaseException) -> bool: ...

Concrete implementations:

  • VLLMBackend — owns /inference/v1/generate, the existing body/response shape, the MM features serde (moves _build_qwen_vl_features here so vllm.* imports are scoped to this file), and the ModelCard max_model_len lookup.
  • SGLangBackend — owns /generate (or the OpenAI-compat path SGLang exposes), the SGLang body/response shape, its analogue of max_model_len (e.g. get_server_info), and the SGLang error shape. Mirrors the examples/sglang/ pattern.

generate(...) then becomes thin glue:

async def generate(*, client, renderer, messages, model, backend: RendererBackend, ...):
    prompt_ids, stop_token_ids, mm_data = await _maybe_offload(renderer, _prepare)
    if max_prompt_len is not None and len(prompt_ids) > max_prompt_len:
        raise PromptTooLongError(...)
    try:
        engine_resp = await backend.generate(client=client, token_ids=prompt_ids, ...)
    except Exception as exc:
        if backend.is_overflow_error(exc):
            raise PromptTooLongError(...) from exc
        raise
    return await _maybe_offload(renderer, lambda: renderer.parse_response(...))

Why composition over subclassing: the axes are orthogonal (engine × MITO/TITO × text/MM) and a subclass tree blows up combinatorially.

Compatibility

renderers.client.generate(...) should keep its current call signature, with backend=VLLMBackend() as the default. Existing callers (prime-rl orchestrator, verifiers' RendererClient) keep working with no source change; opting into another engine is a one-kwarg flip. After the migration window, the vllm.* import can stay inside VLLMBackend and the rest of the package becomes engine-free.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions