Benchmark Run: 12 Models on wp-core-v1

## WP-Bench Test Run

Quick test of the benchmark harness against 12 models using the `wp-core-v1` dataset.

### Results

| Model | Knowledge | Correctness | Overall |
|-------|-----------|-------------|---------|
| claude-sonnet-4-5-20250929 | 88.1% | 47.9% | **45.6%** |
| gpt-5.2 | 90.5% | 44.4% | **44.9%** |
| deepseek/deepseek-reasoner | 83.3% | 48.6% | **44.4%** |
| gpt-5-mini | 83.3% | 43.8% | 42.5% |
| xai/grok-4-1-fast-reasoning | 85.7% | 41.7% | 42.4% |
| claude-opus-4-5-20251101 | 71.4% | 50.0% | 41.4% |
| gemini/gemini-3-flash-preview | 71.4% | 47.9% | 40.6% |
| deepseek/deepseek-chat | 71.4% | 46.5% | 40.0% |
| xai/grok-4-1-fast-non-reasoning | 76.2% | 41.7% | 39.5% |
| groq/llama-3.3-70b-versatile | 81.0% | 35.4% | 38.5% |
| gpt-3.5-turbo | 73.8% | 27.1% | 33.0% |
| groq/llama-3.1-8b-instant | 76.2% | 20.8% | 31.2% |

**Dataset:** wp-core-v1 (42 knowledge + 24 execution tests)

### Takeaways
- Frontier models cluster around 40-46% overall
- Knowledge scores generally strong (70-90%), correctness is the differentiator
- Clear tier gap between frontier and smaller models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Run: 12 Models on wp-core-v1 #5

WP-Bench Test Run

Results

Takeaways

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Knowledge	Correctness	Overall
claude-sonnet-4-5-20250929	88.1%	47.9%	45.6%
gpt-5.2	90.5%	44.4%	44.9%
deepseek/deepseek-reasoner	83.3%	48.6%	44.4%
gpt-5-mini	83.3%	43.8%	42.5%
xai/grok-4-1-fast-reasoning	85.7%	41.7%	42.4%
claude-opus-4-5-20251101	71.4%	50.0%	41.4%
gemini/gemini-3-flash-preview	71.4%	47.9%	40.6%
deepseek/deepseek-chat	71.4%	46.5%	40.0%
xai/grok-4-1-fast-non-reasoning	76.2%	41.7%	39.5%
groq/llama-3.3-70b-versatile	81.0%	35.4%	38.5%
gpt-3.5-turbo	73.8%	27.1%	33.0%
groq/llama-3.1-8b-instant	76.2%	20.8%	31.2%

Benchmark Run: 12 Models on wp-core-v1 #5

Description

WP-Bench Test Run

Results

Takeaways

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions