Add comprehensive execution tests#22
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
nice, this is a great expansion — the suite feels solid and the judge/quality cleanup is nice and tidy 👍 couple of things i'd tweak before merging, the rest can iterate:
smaller stuff, totally fine to defer: loose would also love a CI smoke that runs the reference solutions through great work overall! |
_score_assertions averaged the static and runtime scores 50/50 whenever both keys were present. Empty static checks report a score of 1.0 (and empty runtime checks 0.0), so a test exercising only one section had the other folded in as a free 1.0 or 0.0 — e.g. a runtime-only test that failed every assertion still scored 0.5 correctness. Skip any section whose details.total_weight is 0 so only sections that actually defined checks contribute. The legacy assertions fallback is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
assert_rest_response fell through to passed=true when an assertion defined none of expected_status/expected_data/body_contains/ body_not_contains, so a malformed assertion accepted any response (including a 500). A non-numeric expected_status was also silently skipped. Treat a present-but-non-numeric expected_status as an error, and require at least one expectation before marking the assertion passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
.wp-env.json moved to WordPress 7.0 but the Docker grader was still pinned to 6.9, so the two execution paths ran different cores and the 7.0-only suites (AI Client, Connectors) would fail under Docker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
813c905 to
a0a0676
Compare
|
Thanks, @lezama! Something I'm considering is whether the test "difficulty" property is meaningful. It got added quite a bit ago, and somehow made sense at the time. But looking at it now and seeing how the difficulty was selected for various tests. It feels too subjective. Do you see a reason to keep it? Otherwise, I'm inclined to remove it. |
…ernal The public verify command let callers hand arbitrary verifier payloads directly to WP-CLI, which made it look like part of the benchmark interface. Keeping verification behind the internal runtime entrypoint preserves the harness behavior while making wp-bench run the supported path for executing tests.
The old dry-run command duplicated the run command's inputs and drifted as filtering options were added. Keeping dry-run as a flag on run makes single-test and multi-test selection use the same code path as real benchmark execution.
adae31a to
97496a4
Compare
This captures the execution-test review workflow we developed so future updates have clearer guidance on prompts, runtime assertions, setup and teardown, difficulty, and validation.
|
yeah i'd drop it. poked at the data and a few things stood out:
so unless we want difficulty-sliced reporting soon, i'd remove it and re-add later with a real rubric if we ever need per-difficulty breakdowns. happy to take the removal on if it helps 🙂 |
Summary
wp-core-v1execution suite to 150 deterministic tests across core runtime, content/media/platform, Gutenberg, and AI/platform APIs.expected_behaviorreviewer-facing documentation.Verification
.venv/bin/python -m pytest python.venv/bin/python -m ruff check python.venv/bin/python -m compileall -q python/wp_bench.venv/bin/wp-bench dry-run --config wp-bench.yaml --test-type execution.venv/bin/python datasets/export_dataset.pygit diff --checkpython3 /private/tmp/verify_wpbench_execution_refs.py --mode referencepython3 /private/tmp/verify_wpbench_execution_refs.py --mode wrong