Skip to content

Suggestion: Complementary Fixed-Data Leaderboard to Better Isolate Model Capacity #1151

@andrewmouldon

Description

@andrewmouldon

Fixed-Data Leaderboard Suggestion

The Parameter Golf challenge has been a really interesting problem setup, and it’s already led to a lot of cool ideas. At the same time, the current constraints strongly favor throughput, making it difficult for approaches that improve capacity per parameter but are slightly slower to be competitive.

I wanted to propose a complementary leaderboard that might better isolate parameter efficiency and broaden the space of approaches that can be explored.


Where L(N) shows up in practice

The challenge is framed as optimizing L(N) — best performance given a fixed parameter budget.

However, with the 10-minute training constraint on 8×H100s, the problem behaves much closer to an L(T) setting during training, where throughput becomes the dominant factor. As a result, much of the observable L(N) optimization tends to emerge post-training (e.g., through quantization, compression, and evaluation-time techniques), rather than during the training dynamics themselves.

One implication of this is that even strong architectural changes struggle to be competitive if they introduce even small per-step overhead.

A useful way to quantify this comes from @sseanliu’s post, “Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization” (#831): at current step times, each additional 1ms of overhead requires on the order of ~0.007 BPB improvement just to break even.

More broadly, if even well-optimized alternatives to standard transformer setups (for example, modern sequence models like Gated DeltaNet and Mamba3) are not competitive under these constraints, it suggests that the bottleneck is not necessarily model quality, but throughput.

In that setting, approaches that improve capacity per parameter but are slightly slower are effectively filtered out before they can demonstrate their benefit.

For a challenge that aims to encourage architectural innovation, this creates a difficult dynamic. So far, all top approaches remain quite close to the same underlying architectural patterns seen in the NanoGPT speedrun. Where we do see more divergence is often in post-training techniques, rather than in the training-time architecture itself. Similarly, we are already seeing increasing emphasis on systems-level optimizations (e.g., highly optimized attention implementations, kernel fusion), which are essential for time efficiency but largely orthogonal to parameter efficiency.


Why the unlimited compute track doesn’t fully solve this

The unlimited track is actually very valuable. In the limit, removing both time and data constraints allows the true capacity of an architecture or approach to emerge, making it the most direct setting for evaluating L(N).

However, in practice it introduces a different bottleneck:

  • results become harder to compare
  • longer runs can dominate regardless of underlying model capacity
  • access to compute becomes the primary constraint

So while it is conceptually ideal, it is not equally accessible in practice, particularly for participants with limited resources.


Proposal: a fixed-data leaderboard

A complementary leaderboard could help isolate L(N) more directly:

  • Fix the dataset
  • Keep the parameter / artifact constraints
  • Relax strict wallclock limits

The key idea is to use a moderate, fixed dataset:

  • Large enough to avoid being overly data-limited, including for architectures that require slightly more data to show their gains
  • Small enough to keep total compute reasonable
  • Limited to a single pass to ensure simplicity and fairness

This acts as a practical middle ground:

  • It allows architectures to realize their gains without being dominated by throughput constraints
  • While still bounding total compute in a way that keeps the setting accessible and comparable

Concretely, this would:

  • Allow slower but more expressive architectures to be explored
  • Enable meaningful participation on personal GPUs (by trading time for compute, rather than requiring specific hardware)
  • Implicitly constrain total compute through dataset size, rather than wallclock limits

I’d be very interested to hear others’ thoughts on this.

And to the OpenAI team working on this challenge: thank you for your time and for creating this opportunity to participate!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions