Sampling defaults: would you accept per-instance and cluster-env tiers on top of #1947?

Following up on #1947 (sampling defaults at the model-card tier, merged 2026-04-21). I've been carrying a longer resolution chain in a fork and would like to know whether any of it fits the direction you'd want to take this.

## Current upstream chain

\`request params → card defaults → hardcoded fallback\`

## What I'm carrying on top

\`request → **per-instance** → card → **cluster env** → hardcoded\`

Two extra tiers, plus an extension to two more penalty types and a small fix:

1. **Per-instance defaults** — `BaseInstance` gains `default_temperature` / `default_top_p` / `default_top_k` / `default_min_p`. Use case: deploying the same model as multiple instances with different sampling personalities (e.g. one "deterministic / coding" instance and one "creative / chat" instance of the same Qwen3.5 model). Plumbs through to `resolve_sampling()` ahead of card defaults.
2. **Cluster env tier** — `EXO_DEFAULT_TEMPERATURE` / `TOP_P` / `TOP_K` / `MIN_P` read at module import. Use case: ops-level cluster-wide overrides without editing per-card TOMLs.
3. **`presence_penalty` + `repetition_penalty` in the resolver** — currently only temperature/top_p/top_k/min_p are resolved through the chain. Extends to the two penalties symmetrically and feeds them into `make_logits_processors`.
4. **`context_size` coercion fix** — `repetition_context_size` / `presence_context_size` need to fall back to mlx-lm's default when not specified, and the no-op `repetition_penalty=1.0` case should skip the processor entirely.

## What I'd like to know

- **Per-instance tier**: would you accept this, or is your design intent that model-card is the right layer and per-instance overrides should stay out of the resolver? If you'd prefer instance overrides happen via deploying with a different card, I'll drop this from any PR.
- **Cluster env tier**: same question, lower stakes — happy either way.
- **Penalty extensions + context_size fix**: I'm guessing these are uncontroversial extensions of your work. If so, I'll PR these on their own immediately regardless of the answer to the tier questions.

Happy to PR per your direction. Carrying these in production for ~2 weeks; they work cleanly with #1947's card tier in the middle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling defaults: would you accept per-instance and cluster-env tiers on top of #1947? #1987

Current upstream chain

What I'm carrying on top

What I'd like to know

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sampling defaults: would you accept per-instance and cluster-env tiers on top of #1947? #1987

Description

Current upstream chain

What I'm carrying on top

What I'd like to know

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions