Executed numeric-oracle verification (blind, numbers-vs-numbers) by cmaloney111 · Pull Request #258 · psi-oss/get-physics-done

cmaloney111 · 2026-06-04T22:09:36Z

What & why

GPD's verifier is its moat, but a code audit confirmed the core today runs no executed physics check: run_check is keyword/regex triage that returns passes_physics: False, and the decisive verdict terminates in LLM self-report. This PR adds a real, deterministic numbers-vs-numbers check so a decisive quantitative claim is verified by comparing an independent blind re-derivation against the claim — not by trusting prose.

The engine stays deterministic (zero LLM/MCP calls): the agent calls the gpd-compute oracle to evaluate both evaluators, and the engine only diffs the returned numbers and content-addresses the verdict.

Engine (deterministic core — ships independently, no external package or model needed)

core/numeric_oracle.py — NumericOracleComparison + compare_numeric_oracle (tri-state GREEN / RED / INCONCLUSIVE, fail-closed) + a content-addressed kernel verdict, modeled on reproducibility.py. INCONCLUSIVE lives here; the shared kernel Result stays binary.
verification_checks.py — new check 5.25 contract.numeric_oracle_agreement.
contracts.py — additive, optional ContractObservable numeric-oracle fields (backward-compatible).
verification_server.py — run_contract_check branch: GREEN→pass, RED→fail, INCONCLUSIVE→insufficient_evidence. Anti-fabrication (every accepted number needs a backing executed-cell hash) + proposition-fidelity guard (agreement on the wrong quantity cannot pass GREEN).

Orchestration

New gpd-blind-deriver agent — blinded by tool starvation (only gpd-compute MCP tools; no file/shell/web), sees only the problem statement + ConventionLock, never the answer.
gpd-verifier — Level-3 blind oracle sub-protocol + Oracle Gate hook; full protocol in references/verification/core/blind-oracle-subprotocol.md (kept out of the verifier hotspot prompt).
verify-work — blind+oracle step in inventory-build; verdict surfaced in interactive-validation; INCONCLUSIVE routes to expert_needed (not gap-closure), only RED is a gaps_found issue.

Registration

gpd-compute is a separate public package (psi-oss/gpd-compute) — a no-network sandboxed mpmath/numpy evaluator. Wired here as an optional MCP server via a thin in-repo bridge, gated by module_check so it is inert until installed. Infra descriptor + repo-graph contract regenerated. No compute pyproject extra yet (the package isn't on PyPI; adding it now would make uv.lock unsolvable) — a publish-time follow-up.

Honest scope

On flagship hep-th (interacting-QFT path integrals, GR-tensor canonicalization) an independent numeric evaluator often can't be constructed, so the steady-state verdict there is INCONCLUSIVE — intended, not failure. v1 reliably covers closed-form / integral / special-function / asymptotic / dimensionless quantities. GREEN-rate on real trace problems should be measured before any "revolutionary" framing.

Tests

New: tests/core/test_numeric_oracle.py, tests/mcp/test_numeric_oracle_contract_check.py (GREEN/RED/every INCONCLUSIVE path, determinism).
gpd-compute repo: 10 tests pass incl. the real subprocess sandbox (network/import escapes blocked, values match, special functions run).
Parity baselines updated for the new check, the new agent, and the verifier's oracle hooks (counts, supported-key enumerations, prompt-surface budgets).

CI note

A compare-experiment codex prompt-projection budget was 8 chars over baseline (7708 > 7700) — pre-existing drift on base main, independent of this change (none of the files this PR touches are in that command's projection chain). Bumped the budget to 7750 in projection_budget_support.py so CI is green; no behavior change.

GPD's verifier today runs no executed physics check: run_check is keyword triage (passes_physics hardcoded False) and the verdict terminates in LLM self-report. This adds a real, deterministic numbers-vs-numbers check. Engine (deterministic, zero LLM/oracle calls — it only diffs returned numbers and content-addresses the verdict): - core/numeric_oracle.py: NumericOracleComparison + compare_numeric_oracle (tri-state GREEN/RED/INCONCLUSIVE, fail-closed) + a content-addressed kernel verdict, modeled on reproducibility.py. INCONCLUSIVE lives here; the kernel Result stays binary. - verification_checks.py: new check 5.25 contract.numeric_oracle_agreement. - contracts.py: additive, optional ContractObservable numeric-oracle fields. - verification_server.py: run_contract_check branch mapping GREEN->pass, RED->fail, INCONCLUSIVE->insufficient_evidence (anti-fabrication: every accepted number needs a backing executed-cell hash; proposition-fidelity guard rejects agreement on the wrong quantity). Orchestration: - New gpd-blind-deriver agent: blinded by tool starvation (gpd-compute MCP tools only — no file/shell/web), sees only the problem + ConventionLock. - gpd-verifier: Level-3 blind oracle sub-protocol + Oracle Gate hook, detailed in references/verification/core/blind-oracle-subprotocol.md. - verify-work workflow: blind+oracle step in inventory-build; oracle verdict surfaced in interactive-validation; INCONCLUSIVE routes to expert_needed (not gap-closure), RED is a gaps_found issue. Registration: gpd-compute (separate public package, psi-oss/gpd-compute) wired as an optional MCP server via a thin in-repo bridge, gated by module_check so it is inert until installed; infra descriptor + repo-graph contract regenerated. Tests: numeric_oracle unit + contract-check MCP suites; parity baselines updated for the new check, agent, and verifier oracle hooks.

coderabbitai · 2026-06-04T22:09:45Z

Warning

Review limit reached

@cmaloney111, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 2 minutes and 10 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 29e30dca-381e-43ce-a6b5-bda5edd9858a

📥 Commits

Reviewing files that changed from the base of the PR and between 0f41769 and b08384c.

📒 Files selected for processing (28)

infra/gpd-compute.json
pyproject.toml
src/gpd/agents/gpd-blind-deriver.md
src/gpd/agents/gpd-verifier.md
src/gpd/contracts.py
src/gpd/core/config.py
src/gpd/core/numeric_oracle.py
src/gpd/core/verification_checks.py
src/gpd/mcp/builtin_servers.py
src/gpd/mcp/servers/compute_bridge.py
src/gpd/mcp/servers/verification_server.py
src/gpd/specs/references/verification/core/blind-oracle-subprotocol.md
src/gpd/specs/templates/plan-contract-schema.md
src/gpd/specs/workflows/set-profile.md
src/gpd/specs/workflows/verify-work/gap-repair.md
src/gpd/specs/workflows/verify-work/interactive-validation.md
src/gpd/specs/workflows/verify-work/inventory-build.md
tests/README.md
tests/adapters/projection_budget_support.py
tests/core/test_config.py
tests/core/test_numeric_oracle.py
tests/core/test_prompt_exactness_budget.py
tests/core/test_prompt_surface_diagnostics_budget.py
tests/core/test_verifier_prompt_budget.py
tests/mcp/test_numeric_oracle_contract_check.py
tests/mcp/test_servers.py
tests/mcp/test_verification_contract_request_regressions.py
tests/repo_graph_contract.json

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cameron/numeric-oracle-verification

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The codex projection of compare-experiment measures 7708 chars against a 7700 baseline (drift on base main, unrelated to the numeric-oracle change — none of this branch's files are in that command's projection chain). Nudge the budget so CI is green.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executed numeric-oracle verification (blind, numbers-vs-numbers)#258

Executed numeric-oracle verification (blind, numbers-vs-numbers)#258
cmaloney111 wants to merge 2 commits into
mainfrom
cameron/numeric-oracle-verification

cmaloney111 commented Jun 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Review limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cmaloney111 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Engine (deterministic core — ships independently, no external package or model needed)

Orchestration

Registration

Honest scope

Tests

CI note

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cmaloney111 commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading