*Equal Contributions | 1Carnegie Mellon University | 2Meta AI
In this paper, we first point out the insufficiencies in long-context LLMs evaluation, highlighting:
- Lack of reasoning complexity: Most tasks rely on text retrieval, text summarization, QA.
- Lack of context length: Some tasks are inherently short-context tasks but are bloated to long-context through injecting semantically irrelevant noise.
- Lack of scalability: Admittedly, tasks with high reasoning complexity and high information density exists, but these tasks requires huge human-effort to gather, dedup, and verify. The result is lack of scalability in quantity, making it hard to prevail in the community.
Problem Statement: How can we develop a benchmark that contains sufficient problems at every fine-grained level of reasoning difficulty, from easy retrieval tasks to infinitely hard challenges, while providing infinitely customizable context length with high information density?
GSM-Infinite is a completely synthetic reasoning benchmark that generates problems with infinitely scalable context length and reasoning complexity. Unlike existing benchmarks that rely on text retrieval or summarization, GSM-Infinite creates high information density tasks that can only be solved by long-context LLMs, not by RAG systems.
- 🔄 Infinitely Scalable: Generate problems of any context length and reasoning complexity
- 🧮 High Information Density: Every token matters - RAG systems cannot solve these problems
- 🎯 Three Difficulty Levels: Symbolic, Medium, and Hard subsets
- 📊 Comprehensive Evaluation: Built-in evaluation scripts and leaderboards
- 🔬 Synthetic Generation: No LLMs in the loop, ensuring unbiased benchmarks
Traditional long-context benchmarks can often be solved by RAG systems, making them insufficient for evaluating true long-context reasoning. GSM-Infinite addresses this by:
- High Information Density: Every part of the context is essential
- Reasoning Complexity: Requires multi-step mathematical reasoning
- Infinite Scalability: Generate unlimited test cases at any difficulty
# Clone the repository
git clone https://github.com/Infini-AI-Lab/gsm_infinite.git
cd gsm_infinite
# Install dependencies
pip install -r requirements.txt
# or
pip install -e .
-
Configure your setup by editing
gsm-infinite/config.sh
:# Set your API configuration backend_type='openai' # or 'gemini', 'anthropic' SAMPLER_OPENAI_BASE_URL='your_api_url' SAMPLER_OPENAI_API_KEY='your_api_key' # Configure model and dataset model_name='your_model_name' save_name='your_save_name'
-
Run evaluation:
cd gsm-infinite bash run.sh
Results are stored in gsm-infinite/results
- View results with the interactive dashboard:
streamlit run app.py
gsm_infinite/
├── gsm-infinite/ # Main package
│ ├── app.py # Streamlit results viewer
│ ├── config.sh # Configuration file
│ ├── run.sh # Main execution script
│ ├── preprocess.py # Data preprocessing
│ ├── data/ # Data generation modules
│ │ ├── symbolic/ # Symbolic dataset generation
│ │ └── realistic/ # Medium/Hard dataset generation
│ └── pred/ # Prediction and evaluation scripts
├── docs/ # Detailed documentation
├── static/ # Web assets and images
├── requirements.txt # Python dependencies
└── pyproject.toml # Package configuration
GSM-Infinite provides three types of datasets:
Dataset | Description | Context Length |
---|---|---|
Symbolic | Abstract mathematical operations | 0-32K+ tokens |
Medium | Realistic problems with at most 2-entity implicit relationship | 0-32K+ tokens |
Hard | Realistic problems with at most 3-entity implicit relationship | 0-32K+ tokens |
For detailed information, please refer to our comprehensive documentation:
- 📖 Installation Guide - Detailed setup instructions
- 🚀 Usage Guide - Complete usage examples Evaluate your models -->
- 🏆 Leaderboards - Current model rankings
Our benchmark reveals significant differences in long-context reasoning capabilities across models. See our leaderboards for the latest results.
For complete results and analysis, visit our paper and leaderboard.
If you use GSM-Infinite in your research, please cite our paper:
@misc{zhou2025gsminfinitellmsbehaveinfinitely,
title={GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?},
author={Yang Zhou and Hongyi Liu and Zhuoming Chen and Yuandong Tian and Beidi Chen},
year={2025},
eprint={2502.05252},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.05252},
}
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
- 📧 Contact: [email protected]