You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
renderers.client.generate(...) is the online-HTTP path users reach for when they want a renderer-driven rollout against a running inference server. Today it is hard-wired to vLLM:
Endpoint. Posts to /inference/v1/generate — a vLLM 0.20 extension. SGLang serves its analog at /generate with a different body shape.
Body schema.{ "model", "token_ids", "sampling_params", "features" } matches vLLM's /inference/v1/generate.
Response shape.data.choices[0].token_ids, routed_experts, the vLLM logprobs envelope — these are vLLM-specific extensions.
MM payload._build_mm_features / _build_qwen_vl_features import vllm.entrypoints.serve.disagg.mm_serde and vllm.multimodal.inputs.MultiModalKwargsItems directly; the encoded base64 items are vLLM's MultiModalKwargsItem wire format.
However, not all downstream consumers use vLLM. Hence it's reasonable to also offer a stock SGLang client. To do that, we could make the client that we currently have more extensible.
Proposal
Factor out an explicit RendererBackend protocol in renderers/client.py and let users pick the backend at call time. Sketch:
VLLMBackend — owns /inference/v1/generate, the existing body/response shape, the MM features serde (moves _build_qwen_vl_features here so vllm.* imports are scoped to this file), and the ModelCard max_model_len lookup.
SGLangBackend — owns /generate (or the OpenAI-compat path SGLang exposes), the SGLang body/response shape, its analogue of max_model_len (e.g. get_server_info), and the SGLang error shape. Mirrors the examples/sglang/ pattern.
Why composition over subclassing: the axes are orthogonal (engine × MITO/TITO × text/MM) and a subclass tree blows up combinatorially.
Compatibility
renderers.client.generate(...) should keep its current call signature, with backend=VLLMBackend() as the default. Existing callers (prime-rl orchestrator, verifiers' RendererClient) keep working with no source change; opting into another engine is a one-kwarg flip. After the migration window, the vllm.* import can stay inside VLLMBackend and the rest of the package becomes engine-free.
Problem
renderers.client.generate(...)is the online-HTTP path users reach for when they want a renderer-driven rollout against a running inference server. Today it is hard-wired to vLLM:/inference/v1/generate— a vLLM 0.20 extension. SGLang serves its analog at/generatewith a different body shape.{ "model", "token_ids", "sampling_params", "features" }matches vLLM's/inference/v1/generate.data.choices[0].token_ids,routed_experts, the vLLM logprobs envelope — these are vLLM-specific extensions._build_mm_features/_build_qwen_vl_featuresimportvllm.entrypoints.serve.disagg.mm_serdeandvllm.multimodal.inputs.MultiModalKwargsItemsdirectly; the encoded base64 items are vLLM'sMultiModalKwargsItemwire format.ModelCard.max_model_lenlookup on the verifier side both assume vLLM-shaped responses.However, not all downstream consumers use vLLM. Hence it's reasonable to also offer a stock SGLang client. To do that, we could make the client that we currently have more extensible.
Proposal
Factor out an explicit
RendererBackendprotocol inrenderers/client.pyand let users pick the backend at call time. Sketch:Concrete implementations:
VLLMBackend— owns/inference/v1/generate, the existing body/response shape, the MM features serde (moves_build_qwen_vl_featureshere sovllm.*imports are scoped to this file), and the ModelCardmax_model_lenlookup.SGLangBackend— owns/generate(or the OpenAI-compat path SGLang exposes), the SGLang body/response shape, its analogue ofmax_model_len(e.g.get_server_info), and the SGLang error shape. Mirrors theexamples/sglang/pattern.generate(...)then becomes thin glue:Why composition over subclassing: the axes are orthogonal (engine × MITO/TITO × text/MM) and a subclass tree blows up combinatorially.
Compatibility
renderers.client.generate(...)should keep its current call signature, withbackend=VLLMBackend()as the default. Existing callers (prime-rl orchestrator, verifiers'RendererClient) keep working with no source change; opting into another engine is a one-kwarg flip. After the migration window, thevllm.*import can stay insideVLLMBackendand the rest of the package becomes engine-free.