Skip to content

Sampling defaults: would you accept per-instance and cluster-env tiers on top of #1947? #1987

@adurham

Description

@adurham

Following up on #1947 (sampling defaults at the model-card tier, merged 2026-04-21). I've been carrying a longer resolution chain in a fork and would like to know whether any of it fits the direction you'd want to take this.

Current upstream chain

`request params → card defaults → hardcoded fallback`

What I'm carrying on top

`request → per-instance → card → cluster env → hardcoded`

Two extra tiers, plus an extension to two more penalty types and a small fix:

  1. Per-instance defaultsBaseInstance gains default_temperature / default_top_p / default_top_k / default_min_p. Use case: deploying the same model as multiple instances with different sampling personalities (e.g. one "deterministic / coding" instance and one "creative / chat" instance of the same Qwen3.5 model). Plumbs through to resolve_sampling() ahead of card defaults.
  2. Cluster env tierEXO_DEFAULT_TEMPERATURE / TOP_P / TOP_K / MIN_P read at module import. Use case: ops-level cluster-wide overrides without editing per-card TOMLs.
  3. presence_penalty + repetition_penalty in the resolver — currently only temperature/top_p/top_k/min_p are resolved through the chain. Extends to the two penalties symmetrically and feeds them into make_logits_processors.
  4. context_size coercion fixrepetition_context_size / presence_context_size need to fall back to mlx-lm's default when not specified, and the no-op repetition_penalty=1.0 case should skip the processor entirely.

What I'd like to know

  • Per-instance tier: would you accept this, or is your design intent that model-card is the right layer and per-instance overrides should stay out of the resolver? If you'd prefer instance overrides happen via deploying with a different card, I'll drop this from any PR.
  • Cluster env tier: same question, lower stakes — happy either way.
  • Penalty extensions + context_size fix: I'm guessing these are uncontroversial extensions of your work. If so, I'll PR these on their own immediately regardless of the answer to the tier questions.

Happy to PR per your direction. Carrying these in production for ~2 weeks; they work cleanly with #1947's card tier in the middle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions