Make renderers.client backend-agnostic (vLLM, SGLang, ...)

## Problem

`renderers.client.generate(...)` is the online-HTTP path users reach for when they want a renderer-driven rollout against a running inference server. Today it is hard-wired to vLLM:

- **Endpoint.** Posts to `/inference/v1/generate` — a vLLM 0.20 extension. SGLang serves its analog at `/generate` with a different body shape.
- **Body schema.** `{ "model", "token_ids", "sampling_params", "features" }` matches vLLM's `/inference/v1/generate`.
- **Response shape.** `data.choices[0].token_ids`, `routed_experts`, the vLLM logprobs envelope — these are vLLM-specific extensions.
- **MM payload.** `_build_mm_features` / `_build_qwen_vl_features` import `vllm.entrypoints.serve.disagg.mm_serde` and `vllm.multimodal.inputs.MultiModalKwargsItems` directly; the encoded base64 items are vLLM's `MultiModalKwargsItem` wire format.
- **Context-length error handling** (added in #48): the context-length phrase list and `ModelCard.max_model_len` lookup on the verifier side both assume vLLM-shaped responses.

However, not all downstream consumers use vLLM. Hence it's reasonable to also offer a stock SGLang client. To do that, we could make the client that we currently have more extensible.

## Proposal

Factor out an explicit `RendererBackend` protocol in `renderers/client.py` and let users pick the backend at call time. Sketch:

```python
class RendererBackend(Protocol):
    async def generate(
        self,
        *,
        client: AsyncOpenAI,
        token_ids: list[int],
        sampling_params: dict,
        multi_modal_data: MultiModalData | None,
        ...
    ) -> EngineResponse: ...

    async def resolve_max_prompt_len(
        self, *, client: AsyncOpenAI, model: str
    ) -> int | None: ...

    def is_overflow_error(self, exc: BaseException) -> bool: ...
```

Concrete implementations:

- `VLLMBackend` — owns `/inference/v1/generate`, the existing body/response shape, the MM features serde (moves `_build_qwen_vl_features` here so `vllm.*` imports are scoped to this file), and the ModelCard `max_model_len` lookup.
- `SGLangBackend` — owns `/generate` (or the OpenAI-compat path SGLang exposes), the SGLang body/response shape, its analogue of `max_model_len` (e.g. `get_server_info`), and the SGLang error shape. Mirrors the `examples/sglang/` pattern.

`generate(...)` then becomes thin glue:

```python
async def generate(*, client, renderer, messages, model, backend: RendererBackend, ...):
    prompt_ids, stop_token_ids, mm_data = await _maybe_offload(renderer, _prepare)
    if max_prompt_len is not None and len(prompt_ids) > max_prompt_len:
        raise PromptTooLongError(...)
    try:
        engine_resp = await backend.generate(client=client, token_ids=prompt_ids, ...)
    except Exception as exc:
        if backend.is_overflow_error(exc):
            raise PromptTooLongError(...) from exc
        raise
    return await _maybe_offload(renderer, lambda: renderer.parse_response(...))
```

Why composition over subclassing: the axes are orthogonal (engine × MITO/TITO × text/MM) and a subclass tree blows up combinatorially.

## Compatibility

`renderers.client.generate(...)` should keep its current call signature, with `backend=VLLMBackend()` as the default. Existing callers (prime-rl orchestrator, verifiers' `RendererClient`) keep working with no source change; opting into another engine is a one-kwarg flip. After the migration window, the `vllm.*` import can stay inside `VLLMBackend` and the rest of the package becomes engine-free.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make renderers.client backend-agnostic (vLLM, SGLang, ...) #49

Problem

Proposal

Compatibility

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Make renderers.client backend-agnostic (vLLM, SGLang, ...) #49

Description

Problem

Proposal

Compatibility

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions