APTBench is a benchmark tailored specifically for base LLMs evaluation on agent-related capabilities. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios: software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model’s downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.
Install the requirements with pip: pip install -r requirements.txt.
To run model evaluation, first add your model path config/model2path.json, then follow these steps for a running example:
1. Testing models using vLLM offline batch mode (all models included except the huge MoEs: DSv3(3.1)/GLM-4.5/Kimi-K2)
cd code
bash test_all_vllm_local.shcd code
bash test_tasks_vllm.sh [model_name]cd code
bash test_tasks_hf.sh [model_name]model_name here is the key in the config/model2path.json.
For DSv3(3.1)/GLM-4.5/Kimi-K2, we use SGLang to deploy API service.
The SGLang API scripts is shown in sglang_start_scripts folder.
Take start_0.sh as an example, --model-path is the local path of the model, --served-model-name corresponds to the names in config/model2path.json, --dist-init-addr is the ip of the node that run the script.
On node 0/1, run corresponding scripts as
bash start_0/1.sh [path] [name] [ip]
More details could refer to SGLang for DeepSeek-V3.
After the API service is built, use the test_tasks_sglang_api.sh for testing DSv3(3.1)/GLM-4.5/Kimi-K2 as
bash test_tasks_sglang_api.sh [model_name][model_name] select from {"DeepSeek-V3-Base", "DeepSeek-V3.1-Base", "GLM-4.5-Base", "Kimi-K2-Base"}.
The evaluation of the open-sourced base LLMs are shown in the following figures.
Figure1: APTBench-SWE |
Figure2: APTBench-DR |
This project is licensed under an open source license. See the LICENSE file for details.
- Thanks to the SWE-Smith, InfoDeepSeek, Agentless, DeepResearchBench and ResearchyQuestions projects for part of the seed data.
@misc{qin2025aptbench,
title={APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training},
author={Jiarui Qin and Yunjia Xi and Junjie Huang and Renting Rui and Di Yin and Weiwen Liu and Yong Yu and Weinan Zhang and Xing Sun},
year={2025},
eprint={2510.24397},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.24397},
}
