Following up on #1947 (sampling defaults at the model-card tier, merged 2026-04-21). I've been carrying a longer resolution chain in a fork and would like to know whether any of it fits the direction you'd want to take this.
Current upstream chain
`request params → card defaults → hardcoded fallback`
What I'm carrying on top
`request → per-instance → card → cluster env → hardcoded`
Two extra tiers, plus an extension to two more penalty types and a small fix:
- Per-instance defaults —
BaseInstance gains default_temperature / default_top_p / default_top_k / default_min_p. Use case: deploying the same model as multiple instances with different sampling personalities (e.g. one "deterministic / coding" instance and one "creative / chat" instance of the same Qwen3.5 model). Plumbs through to resolve_sampling() ahead of card defaults.
- Cluster env tier —
EXO_DEFAULT_TEMPERATURE / TOP_P / TOP_K / MIN_P read at module import. Use case: ops-level cluster-wide overrides without editing per-card TOMLs.
presence_penalty + repetition_penalty in the resolver — currently only temperature/top_p/top_k/min_p are resolved through the chain. Extends to the two penalties symmetrically and feeds them into make_logits_processors.
context_size coercion fix — repetition_context_size / presence_context_size need to fall back to mlx-lm's default when not specified, and the no-op repetition_penalty=1.0 case should skip the processor entirely.
What I'd like to know
- Per-instance tier: would you accept this, or is your design intent that model-card is the right layer and per-instance overrides should stay out of the resolver? If you'd prefer instance overrides happen via deploying with a different card, I'll drop this from any PR.
- Cluster env tier: same question, lower stakes — happy either way.
- Penalty extensions + context_size fix: I'm guessing these are uncontroversial extensions of your work. If so, I'll PR these on their own immediately regardless of the answer to the tier questions.
Happy to PR per your direction. Carrying these in production for ~2 weeks; they work cleanly with #1947's card tier in the middle.
Following up on #1947 (sampling defaults at the model-card tier, merged 2026-04-21). I've been carrying a longer resolution chain in a fork and would like to know whether any of it fits the direction you'd want to take this.
Current upstream chain
`request params → card defaults → hardcoded fallback`
What I'm carrying on top
`request → per-instance → card → cluster env → hardcoded`
Two extra tiers, plus an extension to two more penalty types and a small fix:
BaseInstancegainsdefault_temperature/default_top_p/default_top_k/default_min_p. Use case: deploying the same model as multiple instances with different sampling personalities (e.g. one "deterministic / coding" instance and one "creative / chat" instance of the same Qwen3.5 model). Plumbs through toresolve_sampling()ahead of card defaults.EXO_DEFAULT_TEMPERATURE/TOP_P/TOP_K/MIN_Pread at module import. Use case: ops-level cluster-wide overrides without editing per-card TOMLs.presence_penalty+repetition_penaltyin the resolver — currently only temperature/top_p/top_k/min_p are resolved through the chain. Extends to the two penalties symmetrically and feeds them intomake_logits_processors.context_sizecoercion fix —repetition_context_size/presence_context_sizeneed to fall back to mlx-lm's default when not specified, and the no-oprepetition_penalty=1.0case should skip the processor entirely.What I'd like to know
Happy to PR per your direction. Carrying these in production for ~2 weeks; they work cleanly with #1947's card tier in the middle.