Skip to content

Add comprehensive execution tests#22

Open
JasonTheAdams wants to merge 9 commits into
trunkfrom
update-execution-tests
Open

Add comprehensive execution tests#22
JasonTheAdams wants to merge 9 commits into
trunkfrom
update-execution-tests

Conversation

@JasonTheAdams
Copy link
Copy Markdown
Member

Summary

  • Expand the wp-core-v1 execution suite to 150 deterministic tests across core runtime, content/media/platform, Gutenberg, and AI/platform APIs.
  • Remove unused AI judge/quality scaffolding and make expected_behavior reviewer-facing documentation.
  • Add REST/output verifier support, dataset validation, and WordPress 7.0 wp-env configuration.

Verification

  • .venv/bin/python -m pytest python
  • .venv/bin/python -m ruff check python
  • .venv/bin/python -m compileall -q python/wp_bench
  • .venv/bin/wp-bench dry-run --config wp-bench.yaml --test-type execution
  • .venv/bin/python datasets/export_dataset.py
  • git diff --check
  • python3 /private/tmp/verify_wpbench_execution_refs.py --mode reference
  • python3 /private/tmp/verify_wpbench_execution_refs.py --mode wrong

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: JasonTheAdams <jason_the_adams@git.wordpress.org>
Co-authored-by: lezama <migueluy@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@lezama
Copy link
Copy Markdown
Collaborator

lezama commented Jun 4, 2026

nice, this is a great expansion — the suite feels solid and the judge/quality cleanup is nice and tidy 👍

couple of things i'd tweak before merging, the rest can iterate:

  • correctness now averages static + runtime 50/50, and an empty static check scores 1.0 — so a runtime-only test gets a free 0.5. all 150 tests have both today so it's fine in practice, but i'd guard it (only fold static in when there actually are static checks, and probably weight runtime a bit higher).
  • assert_rest_response passes anything when the assertion has no expected_status/expected_data/body_* (falls through to passed = true). i'd default status to 200 or treat an empty assertion as a fail.
  • heads up: dockerfile is still on WP 6.9 while .wp-env.json moved to 7.0 — the docker grader would run the wrong core against the 7.0 tests.

smaller stuff, totally fine to defer: loose == in expected_data ({"id":42} also accepts "42"), the match block is over-indented, and output_not_contains isn't used by any test yet.

would also love a CI smoke that runs the reference solutions through wp bench verify at some point — 147/150 are custom assertions so it'd catch a lot. not blocking 🙂

great work overall!

JasonTheAdams and others added 4 commits June 4, 2026 13:24
_score_assertions averaged the static and runtime scores 50/50 whenever
both keys were present. Empty static checks report a score of 1.0 (and
empty runtime checks 0.0), so a test exercising only one section had the
other folded in as a free 1.0 or 0.0 — e.g. a runtime-only test that
failed every assertion still scored 0.5 correctness.

Skip any section whose details.total_weight is 0 so only sections that
actually defined checks contribute. The legacy assertions fallback is
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
assert_rest_response fell through to passed=true when an assertion
defined none of expected_status/expected_data/body_contains/
body_not_contains, so a malformed assertion accepted any response
(including a 500). A non-numeric expected_status was also silently
skipped.

Treat a present-but-non-numeric expected_status as an error, and require
at least one expectation before marking the assertion passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
.wp-env.json moved to WordPress 7.0 but the Docker grader was still
pinned to 6.9, so the two execution paths ran different cores and the
7.0-only suites (AI Client, Connectors) would fail under Docker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lezama lezama force-pushed the update-execution-tests branch from 813c905 to a0a0676 Compare June 4, 2026 16:25
@JasonTheAdams
Copy link
Copy Markdown
Member Author

Thanks, @lezama! Something I'm considering is whether the test "difficulty" property is meaningful. It got added quite a bit ago, and somehow made sense at the time. But looking at it now and seeing how the difficulty was selected for various tests. It feels too subjective.

Do you see a reason to keep it? Otherwise, I'm inclined to remove it.

…ernal

The public verify command let callers hand arbitrary verifier payloads directly to WP-CLI, which made it look like part of the benchmark interface. Keeping verification behind the internal runtime entrypoint preserves the harness behavior while making wp-bench run the supported path for executing tests.
The old dry-run command duplicated the run command's inputs and drifted as filtering options were added. Keeping dry-run as a flag on run makes single-test and multi-test selection use the same code path as real benchmark execution.
@JasonTheAdams JasonTheAdams force-pushed the update-execution-tests branch from adae31a to 97496a4 Compare June 4, 2026 22:13
This captures the execution-test review workflow we developed so future updates have clearer guidance on prompts, runtime assertions, setup and teardown, difficulty, and validation.
@lezama
Copy link
Copy Markdown
Collaborator

lezama commented Jun 5, 2026

yeah i'd drop it. poked at the data and a few things stood out:

  • it's not actually used anywhere — we parse, store, validate and export it, but nothing reads it for scoring, filtering, or in the report. pure dead metadata right now.
  • in the execution suite it barely varies anyway (123 intermediate / 27 hard, no easy tier), so even as a label it's not saying much.
  • and the taxonomies don't even match across suites (execution is intermediate/hard, knowledge is basic/intermediate/hard) — kind of proves your point about it being subjective.

so unless we want difficulty-sliced reporting soon, i'd remove it and re-add later with a real rubric if we ever need per-difficulty breakdowns. happy to take the removal on if it helps 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants