Add comprehensive execution tests by JasonTheAdams · Pull Request #22 · WordPress/wp-bench

JasonTheAdams · 2026-06-04T01:10:50Z

Summary

Expand the wp-core-v1 execution suite to 150 deterministic tests across core runtime, content/media/platform, Gutenberg, and AI/platform APIs.
Remove unused AI judge/quality scaffolding and make expected_behavior reviewer-facing documentation.
Add REST/output verifier support, dataset validation, and WordPress 7.0 wp-env configuration.

Verification

.venv/bin/python -m pytest python
.venv/bin/python -m ruff check python
.venv/bin/python -m compileall -q python/wp_bench
.venv/bin/wp-bench dry-run --config wp-bench.yaml --test-type execution
.venv/bin/python datasets/export_dataset.py
git diff --check
python3 /private/tmp/verify_wpbench_execution_refs.py --mode reference
python3 /private/tmp/verify_wpbench_execution_refs.py --mode wrong

github-actions · 2026-06-04T01:11:00Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: JasonTheAdams <jason_the_adams@git.wordpress.org>
Co-authored-by: lezama <migueluy@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

lezama · 2026-06-04T15:59:15Z

nice, this is a great expansion — the suite feels solid and the judge/quality cleanup is nice and tidy 👍

couple of things i'd tweak before merging, the rest can iterate:

correctness now averages static + runtime 50/50, and an empty static check scores 1.0 — so a runtime-only test gets a free 0.5. all 150 tests have both today so it's fine in practice, but i'd guard it (only fold static in when there actually are static checks, and probably weight runtime a bit higher).
assert_rest_response passes anything when the assertion has no expected_status/expected_data/body_* (falls through to passed = true). i'd default status to 200 or treat an empty assertion as a fail.
heads up: dockerfile is still on WP 6.9 while .wp-env.json moved to 7.0 — the docker grader would run the wrong core against the 7.0 tests.

smaller stuff, totally fine to defer: loose == in expected_data ({"id":42} also accepts "42"), the match block is over-indented, and output_not_contains isn't used by any test yet.

would also love a CI smoke that runs the reference solutions through wp bench verify at some point — 147/150 are custom assertions so it'd catch a lot. not blocking 🙂

great work overall!

_score_assertions averaged the static and runtime scores 50/50 whenever both keys were present. Empty static checks report a score of 1.0 (and empty runtime checks 0.0), so a test exercising only one section had the other folded in as a free 1.0 or 0.0 — e.g. a runtime-only test that failed every assertion still scored 0.5 correctness. Skip any section whose details.total_weight is 0 so only sections that actually defined checks contribute. The legacy assertions fallback is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

assert_rest_response fell through to passed=true when an assertion defined none of expected_status/expected_data/body_contains/ body_not_contains, so a malformed assertion accepted any response (including a 500). A non-numeric expected_status was also silently skipped. Treat a present-but-non-numeric expected_status as an error, and require at least one expectation before marking the assertion passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

.wp-env.json moved to WordPress 7.0 but the Docker grader was still pinned to 6.9, so the two execution paths ran different cores and the 7.0-only suites (AI Client, Connectors) would fail under Docker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

JasonTheAdams · 2026-06-04T20:59:07Z

Thanks, @lezama! Something I'm considering is whether the test "difficulty" property is meaningful. It got added quite a bit ago, and somehow made sense at the time. But looking at it now and seeing how the difficulty was selected for various tests. It feels too subjective.

Do you see a reason to keep it? Otherwise, I'm inclined to remove it.

…ernal The public verify command let callers hand arbitrary verifier payloads directly to WP-CLI, which made it look like part of the benchmark interface. Keeping verification behind the internal runtime entrypoint preserves the harness behavior while making wp-bench run the supported path for executing tests.

The old dry-run command duplicated the run command's inputs and drifted as filtering options were added. Keeping dry-run as a flag on run makes single-test and multi-test selection use the same code path as real benchmark execution.

This captures the execution-test review workflow we developed so future updates have clearer guidance on prompts, runtime assertions, setup and teardown, difficulty, and validation.

lezama · 2026-06-05T01:40:50Z

yeah i'd drop it. poked at the data and a few things stood out:

it's not actually used anywhere — we parse, store, validate and export it, but nothing reads it for scoring, filtering, or in the report. pure dead metadata right now.
in the execution suite it barely varies anyway (123 intermediate / 27 hard, no easy tier), so even as a label it's not saying much.
and the taxonomies don't even match across suites (execution is intermediate/hard, knowledge is basic/intermediate/hard) — kind of proves your point about it being subjective.

so unless we want difficulty-sliced reporting soon, i'd remove it and re-add later with a real rubric if we ever need per-difficulty breakdowns. happy to take the removal on if it helps 🙂

JasonTheAdams and others added 4 commits June 4, 2026 13:24

feat: add comprehensive execution tests

e2d8d59

lezama force-pushed the update-execution-tests branch from 813c905 to a0a0676 Compare June 4, 2026 16:25

JasonTheAdams added 4 commits June 4, 2026 15:09

fix: remove redundant abilities registry initialization

fbe2393

fix: focus ability category execution test

08342f9

JasonTheAdams force-pushed the update-execution-tests branch from adae31a to 97496a4 Compare June 4, 2026 22:13

docs(agents): add wp-bench execution test skill

2f44ee6

This captures the execution-test review workflow we developed so future updates have clearer guidance on prompts, runtime assertions, setup and teardown, difficulty, and validation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive execution tests#22

Add comprehensive execution tests#22
JasonTheAdams wants to merge 9 commits into
trunkfrom
update-execution-tests

JasonTheAdams commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

lezama commented Jun 4, 2026

Uh oh!

JasonTheAdams commented Jun 4, 2026

Uh oh!

lezama commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JasonTheAdams commented Jun 4, 2026

Summary

Verification

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lezama commented Jun 4, 2026

Uh oh!

JasonTheAdams commented Jun 4, 2026

Uh oh!

lezama commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 4, 2026 •

edited

Loading