MERA Code

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks.

🚀 About

MERA Code brings together a rich collection of code-focused evaluation tasks—both private and public—under one roof. Built on top of the Language Model Evaluation Harness (v0.4.9), it enables researchers and practitioners to:

Compare models on identical tasks and metrics
Reproduce results with fixed prompts and few-shot settings
Submit standardized ZIP archives for leaderboard integration

🔍 Datasets Overview

Set	Task Name	Language	Metrics	Size	Prompts	Skills
Private	ruCodeEval	Python	pass@k	164	10	Instruction Following, Code Perception, Completion, Algorithms & Data Structures
	RuCodeReviewer	Java, Scala, Go, Python	Judge@k, BLEU, chrF	689	10	Instruction Following, Code Perception, Review, Simulation, Explanation, Design Patterns, Style Guides
	CodeLinterEval	Python	pass@k	110	10	Instruction Following, Code Perception, Style Guides, Review, Editing
Public	ruHumanEval	Python	pass@k	164	10	Instruction Following, Code Perception, Completion
	StRuCom	Python, Java, Go, C#, JavaScript	chrF	500	10	Instruction Following, Code Perception, Simulation, Documentation
	UnitTests	Python, Java, Go, C#, JavaScript	CodeBLEU	2500	20	Instruction Following, Code Perception, Synthesis, Testing, Long Context Comprehension
	CodeCorrectness	Python, Java, Go	EM	1361	11	Instruction Following, Code Perception, Simulation, Error Classification
	RealCode	Python	pass@k	802	10	Instruction Following, Code Perception, Completion
	RealCodeJava	Java	pass@k	298	10	Instruction Following, Code Perception, Completion
	JavaTestGen	Java	pass@k, compile@k	227	10	Instruction Following, Code Perception, Completion, Testing
	YABLoCo	C, C++	pass@k, EM	208	11	Instruction Following, Code Perception, Completion, Long Context Comprehension

🛠 Getting Started

First, you need to clone the MERA_CODE repository and load the submodule:

### Go to the folder where the repository will be cloned ###
mkdir mera_code
cd mera_code

### Clone & install core libs ###
git clone --recurse-submodules https://github.com/MERA-Evaluation/MERA_CODE.git
cd MERA_CODE

Now, you can choose one of two evaluation regimes, depending on whether you want to obtain the metrics for public tasks locally or intend to use our remote scoring via the website.

Remote Scoring

Remote Scoring (default): quick setup for cloud-based scoring — install only core dependencies, run the evaluation, and submit the resulting ZIP archive to our website to get the score.

You will not get the metrics even for public datasets (for each dataset, you will see a "bypass" placeholder instead of actual metrics) in the terminal.

Details on Remote Scoring

Install only those libraries that are required to get the model's generations (answers for the queries of each task).

bash scripts/install_dependencies.sh

How it works inside

### Install lm-eval ###
cd lm-evaluation-harness
pip install -e .

### Go to MERA_CODE folder ###
cd ../

You may also need additional libraries for model inference or evaluation. Use lm-eval compatible libraries and their versions:

### Install additional libs for models evaluation [Optional] ###
# vLLM engine
pip install -e ".[vllm]"
# API scoring
pip install -e ".[api]"

### Run evaluation and pack logs ###
bash scripts/run_evaluation.sh \
 --model vllm \
    --model_args "pretrained=Qwen/Qwen2.5-0.5B-Instruct,tensor_parallel_size=1" \
 --output_path "./results/Qwen2.5-0.5B-Instruct"

Local Scoring

Local Scoring (optional): full setup for on-premise evaluation — install extra dependencies with metrics and run Docker containers. Available only for Public sets.

Ensure you have a stable internet connection, sufficient disk space, and adequate CPU resources.

Details on Local Scoring

Evaluation of RealCode, RealCodeJava, and JavaTestGen assumes running hundreds of Docker containers. YABLoCo also requires a significant amount of resources and time.

If you are running the evaluation from inside the Docker container, the integrity of the local scoring is not guaranteed (and this is also not recommended at all).

Even without the Docker-in-Docker issue, being short in resources means that although you would get the metrics, they would be lower than those computed in the environment that fits the scoring in terms of resources.

bash scripts/install_dependencies.sh --local_scoring

How it works inside

# Install code_bleu metric for UnitTests
git clone https://github.com/Pstva/code_bleu.git
cd code_bleu
pip install -e .

# Install metrics for YABLoCo
cd ..
mkdir workspace
cd workspace
git clone -b mera_code https://github.com/yabloco-codegen/yabloco-benchmark

Now, proceed to the evaluations, but with the flag --compute_metrics that enables local metric computation.

### Run evaluation and pack logs ###
bash scripts/run_evaluation.sh \
 --model hf \
    --model_args "pretrained=Qwen/Qwen2.5-0.5B-Instruct,dtype=bfloat16" \
 --compute_metrics \
    --output_path "./results/Qwen2.5-0.5B-Instruct"

More details on run_evaluation.sh usage may be obtained by:

bash scripts/run_evaluation.sh --help

📁 Repository Structure

MERA_CODE/
├── code_tasks/                     # Code for each task
├── datasets/                       # Task descriptions, metadata, readme
├── docs/                           # Additional documentation and design notes
 ├── templates                   # Templates of tasks readme
 ├── dataset_contribution.md     # How to add a new dataset into MERA Code
 ├── dataset_criteria.md         # Criteria to add a new dataset into MERA Code
 ├── dataset_formatting.md       # Dataset formatting requirements
 ├── dataset_hf.md               # How to add new datasets on the MERA HuggingFace page
 ├── dataset_review.md           # General dataset requirements
 ├── model_scoring.md            # How to use lm-eval to evaluate the LMs
 ├── task_codebase.md            # How to add a new task to the codebase
 ├── MERA_code_tax.png           # Taxonomy of coding skills
├── lm-evaluation-harness/          # Submodule (codebase)
└── scripts/                        # Helpers: add tasks, run evaluations, and scoring

💪 How to Join the Leaderboard

Follow these steps to see your model on the Leaderboard:

Run Remote Scoring Evaluate the benchmark in the Remote Scoring regime (see 🛠 Getting Started above). You may run Local Scoring, but you will have to wait twice for submission scoring.

You’ll end up with a logs folder and a ready-to-submit zip archive like Qwen2.5-0.5B-Instruct_submission.zip.

Submit on the website Head over to Create Submission, upload the archive, and move on to the form.
Fill in Model Details Provide accurate information about the model and evaluation. These details are crucial for reproducibility—if something is missing, administrators may ping you (or your Submission might be rejected).
Wait for Scoring ⏳ Scoring usually wraps up in ~2 hours. There is a progress bar to track the scoring process.

Keep in mind that if you submit more than one archive, they are scored sequentially, one after another (not in parallel).

Publish your result Once scoring finishes, click "Submit for moderation". After approval, your model goes Public and appears on the Leaderboard.

Good luck, and happy benchmarking! 🎉

🤝 Contributing

We are interested in improving the MERA Code and invite the community to contribute to the development of new, complex tasks and the project's codebase.

Steps to Add a New Task:

Develop a dataset (on the contributor's side; see task requirements)
Convert the dataset to MERA format (guide)
Upload the dataset to 🤗HF Hub (guide)
Submit the dataset for MERA organizer review (guide)
Write evaluation code using lm-harness (guide)
Benchmark state-of-the-art baseline models on the dataset
Final moderation, and your dataset is officially added!

📝 License

Distributed under the MIT License. See LICENSE for details.

📑 Cite as

@misc{chervyakov2025meracodeunifiedframework,
      title={MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks}, 
      author={Artem Chervyakov and 
        Alexander Kharitonov and 
        Pavel Zadorozhny and 
        Adamenko Pavel and 
        Rodion Levichev and 
        Dmitrii Vorobev and 
        Dmitrii Salikhov and 
        Aidar Valeev and 
        Alena Pestova and 
        Maria Dziuba and 
        Ilseyar Alimova and 
        Artem Zavgorodnev and 
        Aleksandr Medvedev and 
        Stanislav Moiseev and 
        Elena Bruches and 
        Daniil Grebenkin and 
        Roman Derunets and 
        Vikulov Vladimir and 
        Anton Emelyanov and 
        Dmitrii Babaev and 
        Vladimir V. Ivanov and 
        Valentin Malykh and 
        Alena Fenogenova},
      year={2025},
      eprint={2507.12284},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2507.12284}, 
}

Read the paper on arXiv

Name		Name	Last commit message	Last commit date
Latest commit History 320 Commits
code_tasks		code_tasks
datasets		datasets
docs		docs
lm-evaluation-harness @ fef4e77		lm-evaluation-harness @ fef4e77
scripts		scripts
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MERA Code

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks.

🚀 About

🔍 Datasets Overview

🛠 Getting Started

Remote Scoring

Local Scoring

📁 Repository Structure

💪 How to Join the Leaderboard

🤝 Contributing

Steps to Add a New Task:

📝 License

📑 Cite as

About

Uh oh!

Releases

Packages

Contributors 9

Uh oh!

Languages

License

MERA-Evaluation/MERA_CODE

Folders and files

Latest commit

History

Repository files navigation

MERA Code

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks.

🚀 About

🔍 Datasets Overview

🛠 Getting Started

Remote Scoring

Local Scoring

📁 Repository Structure

💪 How to Join the Leaderboard

🤝 Contributing

Steps to Add a New Task:

📝 License

📑 Cite as

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Uh oh!

Languages

Packages