Skip to content

Executed numeric-oracle verification (blind, numbers-vs-numbers)#258

Open
cmaloney111 wants to merge 2 commits into
mainfrom
cameron/numeric-oracle-verification
Open

Executed numeric-oracle verification (blind, numbers-vs-numbers)#258
cmaloney111 wants to merge 2 commits into
mainfrom
cameron/numeric-oracle-verification

Conversation

@cmaloney111

@cmaloney111 cmaloney111 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

What & why

GPD's verifier is its moat, but a code audit confirmed the core today runs no executed physics check: run_check is keyword/regex triage that returns passes_physics: False, and the decisive verdict terminates in LLM self-report. This PR adds a real, deterministic numbers-vs-numbers check so a decisive quantitative claim is verified by comparing an independent blind re-derivation against the claim — not by trusting prose.

The engine stays deterministic (zero LLM/MCP calls): the agent calls the gpd-compute oracle to evaluate both evaluators, and the engine only diffs the returned numbers and content-addresses the verdict.

Engine (deterministic core — ships independently, no external package or model needed)

  • core/numeric_oracle.pyNumericOracleComparison + compare_numeric_oracle (tri-state GREEN / RED / INCONCLUSIVE, fail-closed) + a content-addressed kernel verdict, modeled on reproducibility.py. INCONCLUSIVE lives here; the shared kernel Result stays binary.
  • verification_checks.py — new check 5.25 contract.numeric_oracle_agreement.
  • contracts.py — additive, optional ContractObservable numeric-oracle fields (backward-compatible).
  • verification_server.pyrun_contract_check branch: GREEN→pass, RED→fail, INCONCLUSIVE→insufficient_evidence. Anti-fabrication (every accepted number needs a backing executed-cell hash) + proposition-fidelity guard (agreement on the wrong quantity cannot pass GREEN).

Orchestration

  • New gpd-blind-deriver agent — blinded by tool starvation (only gpd-compute MCP tools; no file/shell/web), sees only the problem statement + ConventionLock, never the answer.
  • gpd-verifier — Level-3 blind oracle sub-protocol + Oracle Gate hook; full protocol in references/verification/core/blind-oracle-subprotocol.md (kept out of the verifier hotspot prompt).
  • verify-work — blind+oracle step in inventory-build; verdict surfaced in interactive-validation; INCONCLUSIVE routes to expert_needed (not gap-closure), only RED is a gaps_found issue.

Registration

gpd-compute is a separate public package (psi-oss/gpd-compute) — a no-network sandboxed mpmath/numpy evaluator. Wired here as an optional MCP server via a thin in-repo bridge, gated by module_check so it is inert until installed. Infra descriptor + repo-graph contract regenerated. No compute pyproject extra yet (the package isn't on PyPI; adding it now would make uv.lock unsolvable) — a publish-time follow-up.

Honest scope

On flagship hep-th (interacting-QFT path integrals, GR-tensor canonicalization) an independent numeric evaluator often can't be constructed, so the steady-state verdict there is INCONCLUSIVE — intended, not failure. v1 reliably covers closed-form / integral / special-function / asymptotic / dimensionless quantities. GREEN-rate on real trace problems should be measured before any "revolutionary" framing.

Tests

  • New: tests/core/test_numeric_oracle.py, tests/mcp/test_numeric_oracle_contract_check.py (GREEN/RED/every INCONCLUSIVE path, determinism).
  • gpd-compute repo: 10 tests pass incl. the real subprocess sandbox (network/import escapes blocked, values match, special functions run).
  • Parity baselines updated for the new check, the new agent, and the verifier's oracle hooks (counts, supported-key enumerations, prompt-surface budgets).

CI note

A compare-experiment codex prompt-projection budget was 8 chars over baseline (7708 > 7700) — pre-existing drift on base main, independent of this change (none of the files this PR touches are in that command's projection chain). Bumped the budget to 7750 in projection_budget_support.py so CI is green; no behavior change.

GPD's verifier today runs no executed physics check: run_check is keyword
triage (passes_physics hardcoded False) and the verdict terminates in LLM
self-report. This adds a real, deterministic numbers-vs-numbers check.

Engine (deterministic, zero LLM/oracle calls — it only diffs returned numbers
and content-addresses the verdict):
- core/numeric_oracle.py: NumericOracleComparison + compare_numeric_oracle
  (tri-state GREEN/RED/INCONCLUSIVE, fail-closed) + a content-addressed kernel
  verdict, modeled on reproducibility.py. INCONCLUSIVE lives here; the kernel
  Result stays binary.
- verification_checks.py: new check 5.25 contract.numeric_oracle_agreement.
- contracts.py: additive, optional ContractObservable numeric-oracle fields.
- verification_server.py: run_contract_check branch mapping GREEN->pass,
  RED->fail, INCONCLUSIVE->insufficient_evidence (anti-fabrication: every
  accepted number needs a backing executed-cell hash; proposition-fidelity
  guard rejects agreement on the wrong quantity).

Orchestration:
- New gpd-blind-deriver agent: blinded by tool starvation (gpd-compute MCP
  tools only — no file/shell/web), sees only the problem + ConventionLock.
- gpd-verifier: Level-3 blind oracle sub-protocol + Oracle Gate hook, detailed
  in references/verification/core/blind-oracle-subprotocol.md.
- verify-work workflow: blind+oracle step in inventory-build; oracle verdict
  surfaced in interactive-validation; INCONCLUSIVE routes to expert_needed
  (not gap-closure), RED is a gaps_found issue.

Registration: gpd-compute (separate public package, psi-oss/gpd-compute) wired
as an optional MCP server via a thin in-repo bridge, gated by module_check so
it is inert until installed; infra descriptor + repo-graph contract regenerated.

Tests: numeric_oracle unit + contract-check MCP suites; parity baselines
updated for the new check, agent, and verifier oracle hooks.
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@cmaloney111, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 2 minutes and 10 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 29e30dca-381e-43ce-a6b5-bda5edd9858a

📥 Commits

Reviewing files that changed from the base of the PR and between 0f41769 and b08384c.

📒 Files selected for processing (28)
  • infra/gpd-compute.json
  • pyproject.toml
  • src/gpd/agents/gpd-blind-deriver.md
  • src/gpd/agents/gpd-verifier.md
  • src/gpd/contracts.py
  • src/gpd/core/config.py
  • src/gpd/core/numeric_oracle.py
  • src/gpd/core/verification_checks.py
  • src/gpd/mcp/builtin_servers.py
  • src/gpd/mcp/servers/compute_bridge.py
  • src/gpd/mcp/servers/verification_server.py
  • src/gpd/specs/references/verification/core/blind-oracle-subprotocol.md
  • src/gpd/specs/templates/plan-contract-schema.md
  • src/gpd/specs/workflows/set-profile.md
  • src/gpd/specs/workflows/verify-work/gap-repair.md
  • src/gpd/specs/workflows/verify-work/interactive-validation.md
  • src/gpd/specs/workflows/verify-work/inventory-build.md
  • tests/README.md
  • tests/adapters/projection_budget_support.py
  • tests/core/test_config.py
  • tests/core/test_numeric_oracle.py
  • tests/core/test_prompt_exactness_budget.py
  • tests/core/test_prompt_surface_diagnostics_budget.py
  • tests/core/test_verifier_prompt_budget.py
  • tests/mcp/test_numeric_oracle_contract_check.py
  • tests/mcp/test_servers.py
  • tests/mcp/test_verification_contract_request_regressions.py
  • tests/repo_graph_contract.json
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cameron/numeric-oracle-verification

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

The codex projection of compare-experiment measures 7708 chars against a 7700
baseline (drift on base main, unrelated to the numeric-oracle change — none of
this branch's files are in that command's projection chain). Nudge the budget
so CI is green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant