WP-Bench Test Run
Quick test of the benchmark harness against 12 models using the wp-core-v1 dataset.
Results
| Model |
Knowledge |
Correctness |
Overall |
| claude-sonnet-4-5-20250929 |
88.1% |
47.9% |
45.6% |
| gpt-5.2 |
90.5% |
44.4% |
44.9% |
| deepseek/deepseek-reasoner |
83.3% |
48.6% |
44.4% |
| gpt-5-mini |
83.3% |
43.8% |
42.5% |
| xai/grok-4-1-fast-reasoning |
85.7% |
41.7% |
42.4% |
| claude-opus-4-5-20251101 |
71.4% |
50.0% |
41.4% |
| gemini/gemini-3-flash-preview |
71.4% |
47.9% |
40.6% |
| deepseek/deepseek-chat |
71.4% |
46.5% |
40.0% |
| xai/grok-4-1-fast-non-reasoning |
76.2% |
41.7% |
39.5% |
| groq/llama-3.3-70b-versatile |
81.0% |
35.4% |
38.5% |
| gpt-3.5-turbo |
73.8% |
27.1% |
33.0% |
| groq/llama-3.1-8b-instant |
76.2% |
20.8% |
31.2% |
Dataset: wp-core-v1 (42 knowledge + 24 execution tests)
Takeaways
- Frontier models cluster around 40-46% overall
- Knowledge scores generally strong (70-90%), correctness is the differentiator
- Clear tier gap between frontier and smaller models
WP-Bench Test Run
Quick test of the benchmark harness against 12 models using the
wp-core-v1dataset.Results
Dataset: wp-core-v1 (42 knowledge + 24 execution tests)
Takeaways