Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions .agents/skills/wp-bench-execution-tests/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
name: wp-bench-execution-tests
description: Add, revise, or review WP-Bench WordPress execution tests. Use when working on datasets/suites/*/execution JSON, runtime_checks, static_checks, reference_solution, expected_behavior, test ID filtering, WordPress API benchmark coverage, or PR review comments about execution test quality.
---

# WP-Bench Execution Tests

Use this skill when adding or reviewing execution tests for WP-Bench.

## Workflow

1. Inspect nearby execution and knowledge tests before editing. Match the suite's organization, naming style, and category balance.
2. Treat the WordPress source/runtime as the authority. For modern APIs, verify behavior against WordPress 7.0 source or official field-guide docs before writing assertions.
3. Define the observable WordPress behavior first. Prompts should be specific enough to identify the intended API area and outcome, but should not give away exact implementation details that the test is meant to measure, such as a particular argument key, metadata field, or helper call. Ask for a behavior or artifact, not an arbitrary wrapper function, unless the function itself is the contract.
4. Keep `requirements` concise and model-facing. They are appended to the prompt.
5. Keep `expected_behavior` reviewer-facing. It documents the contract and review focus; it is not used for scoring.
6. Use `reference_solution` as the canonical passing implementation. It is for verification and maintenance, not model input.
7. Make static checks robust for the contract: require expected functions, methods, classes, hooks, slugs, schema keys, and other identifiers when their use is essential to the task. Do not require incidental helpers or checker calls that the runtime assertion can perform itself.
8. Make runtime checks test the behavior inside WordPress. Use built-in assertion types when they directly express the check, such as output containment or REST response checks. Use `custom_assertion` when the verifier needs PHP to inspect the result, such as checking a registered category, returned value, database state, capability result, dispatched hook, or computed WordPress output.
9. Verify `reference_solution` with `wp-bench run --check-reference-solution` for every new or modified execution test.

## Field Semantics

- `prompt`: The task sent to the model.
- `requirements`: Additional model-facing constraints.
- `expected_behavior`: Reviewer documentation.
- `reference_solution`: Canonical passing code used for author verification.
- `static_checks`: Coarse guardrails for required or forbidden code patterns.
- `runtime_checks.setup`: Optional PHP fixture setup evaluated before the submitted code.
- `runtime_checks.assertions`: WordPress-executed behavioral assertions evaluated after the submitted code.
- `runtime_checks.teardown`: Optional PHP cleanup evaluated after assertions, even when setup, submitted code, or assertions fail. Use it for cleanup, not correctness.
- `metadata.source_refs`: Required source pointers.

## Prompt And Assertion Shape

Write tests around the contract, not the harness mechanics.

Good:

```json
{
"prompt": "Register an Abilities API category with the slug 'wpbp-tools' so it is discoverable by WordPress.",
"static_checks": {
"required_patterns": [
{ "pattern": "wp_register_ability_category", "description": "Uses the Abilities category API", "weight": 1 },
{ "pattern": "wpbp-tools", "description": "Registers the requested category slug", "weight": 1 },
{ "pattern": "wp_abilities_api_categories_init", "description": "Uses the category init hook", "weight": 1 }
]
},
"runtime_checks": {
"assertions": [
{
"type": "custom_assertion",
"code": "return wp_has_ability_category( 'wpbp-tools' );",
"description": "The wpbp-tools category is discoverable",
"weight": 1
}
]
}
}
```

Avoid:

- Requiring a wrapper function name unless implementing that function is the real task.
- Requiring the model to call the same checker API that the runtime assertion can call.
- Putting fixture cleanup inside assertions instead of `runtime_checks.teardown`.
- Adding cleanup by habit when the state is process-local.
- Making `prompt` and `expected_behavior` duplicates.

Static check patterns are regular expressions. Delimiterless patterns are wrapped by the runtime, so simple slugs like `wpbp/count-words` can be written without escaping. Use explicit regex delimiters only when flags are needed, such as `/pattern/i`.

Use the pattern list to enforce important API surface, not just one token from the prompt. If a task requires retrieving an ability and executing it, check for the function, method, and ability name, such as `wp_get_ability`, `execute`, and `wpbp/add-one`.

## Difficulty

Treat `difficulty` as author-estimated implementation complexity, not scoring.

- `basic`: One obvious API or behavior, minimal setup.
- `intermediate`: Combines multiple WordPress concepts or requires lifecycle timing, setup, teardown, or edge handling.
- `hard`: Requires newer/obscure APIs plus nontrivial interaction, permissions, schemas, REST exposure, block/editor internals, or runtime reasoning.

Do not mark a test `hard` only because the API is new.

## Setup, Teardown, And Isolation

Runtime order is `setup`, submitted code, assertions, then `teardown`.

Use `runtime_checks.setup` to create fixtures the submitted code or assertions need. Use `runtime_checks.teardown` to remove persistent fixtures and restore global state. Keep assertions focused on measuring behavior.

Clean up state in `teardown` when it persists beyond the PHP process or can affect later assertions:

- posts, users, terms, comments, options, metadata
- scheduled cron events and transients
- object cache values with reusable keys/groups
- files or uploads created during the test

Avoid cleanup for in-process-only registries when each verifier run starts a fresh WP-CLI process. Extra cleanup can make failing cases noisy and less diagnostic.

## Validation

For each changed test, run:

```bash
.venv/bin/python -m pytest python/tests/test_execution_dataset.py
.venv/bin/wp-bench run --config wp-bench.yaml --dry-run --test-type execution --test-id <test-id>
.venv/bin/wp-bench run --config wp-bench.yaml --check-reference-solution --test-type execution --test-id <test-id>
```

Require the dry run to select only the requested test ID or IDs. Require the reference-solution run to execute the selected tests through the real WordPress verifier, without model calls, and pass every selected test.

For broad suite changes, also run:

```bash
.venv/bin/wp-bench run --config wp-bench.yaml --dry-run --test-type execution
.venv/bin/wp-bench run --config wp-bench.yaml --check-reference-solution --test-type execution
.venv/bin/python datasets/export_dataset.py
git diff --check
```

## Determinism

- AI Client tests must not make live provider calls or require credentials.
- Avoid network, uncontrolled time, random IDs without cleanup, and dependency on unrelated global state.
- Prefer deterministic WordPress fixtures created by setup code and removed by teardown when persistent.
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ The official WordPress AI benchmark. Evaluate how well language models understan
WP-Bench measures AI model capabilities across two dimensions:

- **Knowledge** — Multiple-choice and short-answer questions testing WordPress concepts, APIs, and best practices
- **Execution** — Code generation tasks graded by a real WordPress runtime for correctness and quality
- **Execution** — Code generation tasks graded by static checks and runtime assertions in a real WordPress environment

The benchmark uses WordPress itself as the grader, running generated code in a sandboxed environment with static analysis and runtime assertions.

Expand Down Expand Up @@ -82,6 +82,8 @@ grader:
run:
suite: wp-core-v1
limit: 10 # limit tests (null = all)
test_ids: [] # optional explicit test IDs to run
dry_run: false # load/filter tests without calling models
concurrency: 4

output:
Expand All @@ -97,7 +99,9 @@ wp-bench run --config wp-bench.yaml # run with config file
wp-bench run --model-name gpt-4o --limit 5 # quick single-model test
wp-bench run --test-type knowledge # run only knowledge tests (no WordPress env needed)
wp-bench run --test-type execution # run only execution tests
wp-bench dry-run --config wp-bench.yaml # validate config without calling models
wp-bench run --test-type execution --test-id e-abilities-api-001
wp-bench run --test-id e-abilities-api-001 --test-id e-rest-api-001
wp-bench run --config wp-bench.yaml --dry-run # validate config without calling models
```

## Repository Structure
Expand Down Expand Up @@ -146,16 +150,11 @@ The notebook generates:
## How Grading Works

1. The harness sends a prompt to the model requesting WordPress code
2. Generated code is sent to the WordPress runtime via WP-CLI
2. Generated code is sent to the WordPress runtime
3. The runtime performs static analysis (syntax, coding standards, security)
4. Code executes in a sandbox with test assertions
5. Results return as JSON with scores and detailed feedback

```bash
# Manual grading example (run from runtime/ directory)
npm run wp-bench -- verify --payload=$(echo '{"code":"<?php echo 1;"}' | base64)
```

## Development

```bash
Expand Down
1 change: 1 addition & 0 deletions datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ dataset:
|-------|------|-------------|
| `id` | string | Unique test ID |
| `prompt` | string | Task description for the model |
| `expected_behavior` | string | Reviewer-facing contract describing the behavior assertions should cover |
| `requirements` | array | List of requirements the solution must meet |
| `static_checks` | object | Regex patterns to check in generated code |
| `runtime_checks` | object | Assertions to run in WordPress environment |
Expand Down
4 changes: 2 additions & 2 deletions datasets/export_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def load_suite(suite_name: str) -> list[dict]:
"test_kind": "execution",
"type": "execution",
"prompt": t["prompt"],
"expected_behavior": t.get("expected_behavior", ""),
"category": t.get("category", "general"),
"difficulty": t.get("difficulty", "unknown"),
"choices": orjson.dumps(t.get("choices", [])).decode(),
Expand All @@ -44,7 +45,6 @@ def load_suite(suite_name: str) -> list[dict]:
"requirements": orjson.dumps(t.get("requirements", [])).decode(),
"static_checks": orjson.dumps(t.get("static_checks", {})).decode(),
"runtime_checks": orjson.dumps(t.get("runtime_checks", {})).decode(),
"judge_config": orjson.dumps(t.get("judge_config", {})).decode(),
"reference_solution": t.get("reference_solution", ""),
})

Expand All @@ -60,6 +60,7 @@ def load_suite(suite_name: str) -> list[dict]:
"test_kind": "knowledge",
"type": t.get("type", "knowledge"),
"prompt": t["prompt"],
"expected_behavior": "",
"category": t.get("category", "general"),
"difficulty": t.get("difficulty", "unknown"),
"choices": orjson.dumps(t.get("choices", [])).decode(),
Expand All @@ -68,7 +69,6 @@ def load_suite(suite_name: str) -> list[dict]:
"requirements": "[]",
"static_checks": "{}",
"runtime_checks": "{}",
"judge_config": "{}",
"reference_solution": "",
})

Expand Down
Loading