LLM Chess: Benchmarking Reasoning and Instruction-Following in LLMs

LLM Chess is a benchmark that evaluates Large Language Models (LLMs) on their reasoning and instruction-following abilities in an agentic setting. LLMs engage in multi-turn dialogs to play chess against opponents like a Random Player or the Komodo Dragon chess engine. This setup tests both strategic reasoning (chess skill) and protocol adherence (sustained interaction without errors).

Key insights from the benchmark:

Early models (2024) struggled with basic instruction following, often hallucinating illegal moves or failing dialogs.
Advanced reasoning models (e.g., o1, o3, o4-mini) in 2025 saturated random-based evaluations, prompting the addition of Dragon as a stronger opponent for Elo anchoring.
Metrics separate chess skill (Win/Loss, Elo) from durability (Game Duration), revealing trade-offs in model capabilities.

See the live leaderboard for rankings and the NeurIPS FoRLM 2025 paper for full details.

Installation and Setup

Clone the repository:

git clone https://github.com/maxim-saplin/llm_chess.git
cd llm_chess

Create a virtual environment (recommended):

# Using uv (fast alternative to pip)
uv venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Or using pip/venv
python -m venv .venv
source .venv/bin/activate

Install dependencies:

uv pip install -r requirements.txt  # Or pip install -r requirements.txt

Configure LLMs:
- Copy .env.sample to .env and add your API keys.
- Suffixes like _W (white) and _B (black) distinguish configs for multi-LLM setups.
- Supports Azure OpenAI, OpenAI, Anthropic, Google, Groq, and local models via Autogen.
- For local models, ensure Ollama or LM Studio is running.
Chess Engines (optional, for stronger opponents):
- Komodo Dragon: Download binaries from komodochess.com and place in dragon/. Set llm_chess.dragon_path.
- Stockfish: Install via brew install stockfish (macOS) or equivalent. Set llm_chess.stockfish_path (default: /opt/homebrew/bin/stockfish).

Running Games

Single Game

Run a single chess simulation:

python llm_chess.py

Default: Random Player (white) vs. LLM (black).
Logs saved to _logs/ with JSON details and optional video recordings.

Multiple Games

For benchmarking, run multiple simulations:

python run_multiple_games.py

Default: 42 games.
Customize in the script:
- NUM_REPETITIONS: Number of games (e.g., 30+ for reliable stats).
- LOG_FOLDER: Output directory (e.g., _logs/random_vs_llm/).
- STORE_INDIVIDUAL_LOGS: Set to False for aggregate JSON only.
Aggregates results in aggregate_results.json and individual logs in {timestamp}.json.

Game Rules

Players: Random (white) vs. LLM (black) by default. Supports LLM vs. LLM, engine vs. LLM.
Constraints:
- Max 200 moves (100 per player).
- Max 10 turns per LLM move (user/assistant pairs).
- Max 3 mistakes per dialog (illegal moves/actions); exceeds → LLM loss.
Outcomes:
- Win/Loss: Checkmate or opponent errors/timeouts.
- Draw: Max moves reached, stalemate, insufficient material, repetition, or 75-move rule.
- Errors: Programmatic issues → Draw (manual review for API throttles/model failures → discard or LLM loss).
Games use UCI notation for moves and Unicode boards for visualization.

Configurations

Edit globals in llm_chess.py or pass via run_multiple_games.py:

white_player_type / black_player_type: RANDOM_PLAYER, LLM, CHESS_ENGINE_DRAGON, CHESS_ENGINE_STOCKFISH.
enable_reflection: Enable "reflect" action for strategic thinking (extra tokens).
use_fen_board: Use FEN notation instead of Unicode board (default: False).
max_game_moves: Max moves (default: 200).
Per-move LLM limits:
- max_llm_turns: Max dialog turns (default: 10).
- max_failed_attempts: Max errors before loss (default: 3).
throttle_delay_moves: API delay (default: 1s) to avoid rate limits.

Agents

LLM Agent: Autogen ConversableAgent for dialog-based moves. Prompts guide actions: get_current_board, get_legal_moves, make_move <UCI>.
Random Agent: Custom; requests legal moves, picks randomly. Always white.
Proxy Agent: Custom AutoReplyAgent; orchestrates dialogs, provides board/moves.
Chess Engines:
- Dragon: Elo-rated. Binaries in dragon/.
  - Level 1: 250 Elo
  - Level 2: 375 Elo
  - Level 3: 500 Elo
  - Level 4: 625 Elo
  - Level 5: 750 Elo
  - Level 6: 875 Elo
  - Level 7: 1000 Elo
  - Formula: Elo = 125 × (level + 1)
- Stockfish: Strong engine; install separately.

Processing Logs

Logs in _logs/ contain JSON per game. Aggregate and refine:

python data_processing/get_refined_elo.py

Handles multiple directories (Random vs. LLM, Dragon vs. LLM).
Computes Elo (anchored to Dragon levels: Elo ≈ 125 × (level + 1)), Win/Loss %, Game Duration %.
Filters low-sample models; supports overrides/aliases.
Output: CSV with player stats, usage (tokens/cost), interruptions.

Manual review: Check logs for API errors (discard) vs. model failures (LLM loss).

Metrics

From refined CSV/leaderboard:

Elo: Estimated rating (±95% CI), anchored to Dragon/chess.com. Combines Random/Dragon data.
Win/Loss: (Wins - Losses) / Total % (0-100%). Blends skill + instruction following. 50% = balanced.
Game Duration: % of max moves completed (0-100%). Measures dialog stability (100% = no interruptions).
Tokens: Completions per move. Indicates verbosity/efficiency.
Other: Mistakes/1000 moves, cost/game, material diff, interruptions.

Primary sort: Elo (DESC), then Win/Loss (DESC), Duration (DESC), Tokens (ASC). Dragon-tested models marked with *.

Matrix View (in leaderboard): Win Rate (skill) vs. Duration (following) for 2D clustering.

Sample Dialog (Single Move)

Example with GPT-4o Mini (Black):

Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.

Player_Black (to Proxy):
get_current_board

Proxy (to Player_Black):
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖


Player_Black (to Proxy):
get_legal_moves

Proxy (to Player_Black):
a7a6, b7b6, c7c6, d7d6, e7e6, f7f6, g7g6, h7h6, a7a5, b7b5, c7c5, d7d5, e7e5, f7f5, g7g5, h7h5

Player_Black (to Proxy):
make_move e7e5

Proxy (to Player_Black):
Move made, switching player

References

Libraries: chess (board rules), Autogen (agents/dialogs), Stockfish/Dragon (engines).
Data: Logs in _logs/; analysis in data_processing/ and analysis_logs/.
Notes/Changelog: docs/notes.md for updates, model tiers, and insights.
License: MIT (see LICENSE).
Contribute: Fork, PR improvements to setup, agents, or analysis.

For issues or questions, open a GitHub issue.

Name		Name	Last commit message	Last commit date
Latest commit History 1,250 Commits
.cursor/rules		.cursor/rules
.vscode		.vscode
_logs		_logs
analysis		analysis
data_processing		data_processing
docs		docs
run_templates		run_templates
tests		tests
.env.sample		.env.sample
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
custom_agents.py		custom_agents.py
do_call.py		do_call.py
get_run_metadata.py		get_run_metadata.py
llm_chess.py		llm_chess.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_multiple_games.py		run_multiple_games.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Chess: Benchmarking Reasoning and Instruction-Following in LLMs

Installation and Setup

Running Games

Single Game

Multiple Games

Game Rules

Configurations

Agents

Processing Logs

Metrics

Sample Dialog (Single Move)

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

maxim-saplin/llm_chess

Folders and files

Latest commit

History

Repository files navigation

LLM Chess: Benchmarking Reasoning and Instruction-Following in LLMs

Installation and Setup

Running Games

Single Game

Multiple Games

Game Rules

Configurations

Agents

Processing Logs

Metrics

Sample Dialog (Single Move)

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages