APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training

🌟 Overview

APTBench is a benchmark tailored specifically for base LLMs evaluation on agent-related capabilities. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios: software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model’s downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.

🚀 Quick Start

Setup

Install the requirements with pip: pip install -r requirements.txt.

To run model evaluation, first add your model path config/model2path.json, then follow these steps for a running example:

Run Benchmark

1. Testing models using vLLM offline batch mode (all models included except the huge MoEs: DSv3(3.1)/GLM-4.5/Kimi-K2)

cd code
bash test_all_vllm_local.sh

2. Testing one single model using vLLM offline batch mode

cd code
bash test_tasks_vllm.sh [model_name]

3. Testing tasks using HuggingFace inference

cd code
bash test_tasks_hf.sh [model_name]

model_name here is the key in the config/model2path.json.

4. Testing tasks using SGLang

For DSv3(3.1)/GLM-4.5/Kimi-K2, we use SGLang to deploy API service. The SGLang API scripts is shown in sglang_start_scripts folder. Take start_0.sh as an example, --model-path is the local path of the model, --served-model-name corresponds to the names in config/model2path.json, --dist-init-addr is the ip of the node that run the script. On node 0/1, run corresponding scripts as

bash start_0/1.sh [path] [name] [ip]

More details could refer to SGLang for DeepSeek-V3.

After the API service is built, use the test_tasks_sglang_api.sh for testing DSv3(3.1)/GLM-4.5/Kimi-K2 as

bash test_tasks_sglang_api.sh [model_name]

[model_name] select from {"DeepSeek-V3-Base", "DeepSeek-V3.1-Base", "GLM-4.5-Base", "Kimi-K2-Base"}.

Base LLMs evaluation results

The evaluation of the open-sourced base LLMs are shown in the following figures.

Figure1: APTBench-SWE

Figure2: APTBench-DR

📄 License

This project is licensed under an open source license. See the LICENSE file for details.

🙏 Acknowledgments

Thanks to the SWE-Smith, InfoDeepSeek, Agentless, DeepResearchBench and ResearchyQuestions projects for part of the seed data.

✍️ Citation

@misc{qin2025aptbench,
      title={APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training}, 
      author={Jiarui Qin and Yunjia Xi and Junjie Huang and Renting Rui and Di Yin and Weiwen Liu and Yong Yu and Weinan Zhang and Xing Sun},
      year={2025},
      eprint={2510.24397},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.24397}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
code		code
data		data
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training

🌟 Overview

🚀 Quick Start

Setup

Run Benchmark

1. Testing models using vLLM offline batch mode (all models included except the huge MoEs: DSv3(3.1)/GLM-4.5/Kimi-K2)

2. Testing one single model using vLLM offline batch mode

3. Testing tasks using HuggingFace inference

4. Testing tasks using SGLang

Base LLMs evaluation results

📄 License

🙏 Acknowledgments

✍️ Citation

About

Uh oh!

Releases

Packages

Languages

License

TencentYoutuResearch/APTBench

Folders and files

Latest commit

History

Repository files navigation

APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training

🌟 Overview

🚀 Quick Start

Setup

Run Benchmark

1. Testing models using vLLM offline batch mode (all models included except the huge MoEs: DSv3(3.1)/GLM-4.5/Kimi-K2)

2. Testing one single model using vLLM offline batch mode

3. Testing tasks using HuggingFace inference

4. Testing tasks using SGLang

Base LLMs evaluation results

📄 License

🙏 Acknowledgments

✍️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages