Skip to content

Optimize Evaluation Workflow for Better Batching and Model Reuse For benchmarks with n_repeat > 1 #125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ihebchaa
Copy link

The current Evalchemy evaluation workflow for banchmarks with n_repeat > 1 is highly inefficient:

# Current inefficient approach
for i in range(n_repeat):
    evaluate()  # Reloads model every iteration

Key inefficiencies

  • Model reloading overhead: Models are loaded/unloaded n_repeat times, wasting significant time and GPU memory cycling
  • Poor batching: Small benchmarks like AIME24/AIME25 (30 samples) result in tiny batches that underutilize GPU resources

Solution

Restructure the evaluation workflow to load model once and batch across all repeats:

Key Improvements

  • Single model loading: Load model once for all evaluation repeats
  • Enhanced batching: Combine samples across repeats for larger, more efficient batches
  • Memory efficiency: Eliminate repeated model loading/unloading cycles
  • Better GPU utilization: Larger batches maximize hardware throughput

Speedup

Tests on a 7B reasoning model using AIME24 with n_repeat=8, max_new_tokens=32k, and batch_size set to n_repeat * num_samples (i.e., total samples, so that vLLM processes all instances at once and handles batching) show nearly an 8× speedup.

@slimfrkha
Copy link

slimfrkha commented Jun 2, 2025

In the case of DP > 1
because of different seed (different for each n_repeat), batch will be splitted to chunks (see collator in task.py) and each chunk is generated (after splitted to DP) in a for loop.
The problem here is that

for chunk in chunks:
    self.generate(chunk)  # -> load DP models for each chunk

@neginraoof
Copy link
Collaborator

Thanks a lot @ihebchaa for the PR! Overall looks good to me, I'll look into testing and merging this.
@slimfrkha for DP > 1, do you think using a custom collator that doesn't chunk by seed would help?

@slimfrkha
Copy link

Thanks a lot @ihebchaa for the PR! Overall looks good to me, I'll look into testing and merging this. @slimfrkha for DP > 1, do you think using a custom collator that doesn't chunk by seed would help?

Yes i think it is the way to go. planning to open a PR to change vllm_causallms.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants