Executed numeric-oracle verification (blind, numbers-vs-numbers)#258
Executed numeric-oracle verification (blind, numbers-vs-numbers)#258cmaloney111 wants to merge 2 commits into
Conversation
GPD's verifier today runs no executed physics check: run_check is keyword triage (passes_physics hardcoded False) and the verdict terminates in LLM self-report. This adds a real, deterministic numbers-vs-numbers check. Engine (deterministic, zero LLM/oracle calls — it only diffs returned numbers and content-addresses the verdict): - core/numeric_oracle.py: NumericOracleComparison + compare_numeric_oracle (tri-state GREEN/RED/INCONCLUSIVE, fail-closed) + a content-addressed kernel verdict, modeled on reproducibility.py. INCONCLUSIVE lives here; the kernel Result stays binary. - verification_checks.py: new check 5.25 contract.numeric_oracle_agreement. - contracts.py: additive, optional ContractObservable numeric-oracle fields. - verification_server.py: run_contract_check branch mapping GREEN->pass, RED->fail, INCONCLUSIVE->insufficient_evidence (anti-fabrication: every accepted number needs a backing executed-cell hash; proposition-fidelity guard rejects agreement on the wrong quantity). Orchestration: - New gpd-blind-deriver agent: blinded by tool starvation (gpd-compute MCP tools only — no file/shell/web), sees only the problem + ConventionLock. - gpd-verifier: Level-3 blind oracle sub-protocol + Oracle Gate hook, detailed in references/verification/core/blind-oracle-subprotocol.md. - verify-work workflow: blind+oracle step in inventory-build; oracle verdict surfaced in interactive-validation; INCONCLUSIVE routes to expert_needed (not gap-closure), RED is a gaps_found issue. Registration: gpd-compute (separate public package, psi-oss/gpd-compute) wired as an optional MCP server via a thin in-repo bridge, gated by module_check so it is inert until installed; infra descriptor + repo-graph contract regenerated. Tests: numeric_oracle unit + contract-check MCP suites; parity baselines updated for the new check, agent, and verifier oracle hooks.
|
Warning Review limit reached
More reviews will be available in 2 minutes and 10 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (28)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The codex projection of compare-experiment measures 7708 chars against a 7700 baseline (drift on base main, unrelated to the numeric-oracle change — none of this branch's files are in that command's projection chain). Nudge the budget so CI is green.
What & why
GPD's verifier is its moat, but a code audit confirmed the core today runs no executed physics check:
run_checkis keyword/regex triage that returnspasses_physics: False, and the decisive verdict terminates in LLM self-report. This PR adds a real, deterministic numbers-vs-numbers check so a decisive quantitative claim is verified by comparing an independent blind re-derivation against the claim — not by trusting prose.The engine stays deterministic (zero LLM/MCP calls): the agent calls the
gpd-computeoracle to evaluate both evaluators, and the engine only diffs the returned numbers and content-addresses the verdict.Engine (deterministic core — ships independently, no external package or model needed)
core/numeric_oracle.py—NumericOracleComparison+compare_numeric_oracle(tri-state GREEN / RED / INCONCLUSIVE, fail-closed) + a content-addressed kernel verdict, modeled onreproducibility.py. INCONCLUSIVE lives here; the shared kernelResultstays binary.verification_checks.py— new check 5.25contract.numeric_oracle_agreement.contracts.py— additive, optionalContractObservablenumeric-oracle fields (backward-compatible).verification_server.py—run_contract_checkbranch: GREEN→pass, RED→fail, INCONCLUSIVE→insufficient_evidence. Anti-fabrication (every accepted number needs a backing executed-cell hash) + proposition-fidelity guard (agreement on the wrong quantity cannot pass GREEN).Orchestration
gpd-blind-deriveragent — blinded by tool starvation (onlygpd-computeMCP tools; no file/shell/web), sees only the problem statement + ConventionLock, never the answer.gpd-verifier— Level-3 blind oracle sub-protocol + Oracle Gate hook; full protocol inreferences/verification/core/blind-oracle-subprotocol.md(kept out of the verifier hotspot prompt).verify-work— blind+oracle step ininventory-build; verdict surfaced ininteractive-validation; INCONCLUSIVE routes toexpert_needed(not gap-closure), only RED is agaps_foundissue.Registration
gpd-computeis a separate public package (psi-oss/gpd-compute) — a no-network sandboxed mpmath/numpy evaluator. Wired here as an optional MCP server via a thin in-repo bridge, gated bymodule_checkso it is inert until installed. Infra descriptor + repo-graph contract regenerated. Nocomputepyproject extra yet (the package isn't on PyPI; adding it now would makeuv.lockunsolvable) — a publish-time follow-up.Honest scope
On flagship hep-th (interacting-QFT path integrals, GR-tensor canonicalization) an independent numeric evaluator often can't be constructed, so the steady-state verdict there is INCONCLUSIVE — intended, not failure. v1 reliably covers closed-form / integral / special-function / asymptotic / dimensionless quantities. GREEN-rate on real trace problems should be measured before any "revolutionary" framing.
Tests
tests/core/test_numeric_oracle.py,tests/mcp/test_numeric_oracle_contract_check.py(GREEN/RED/every INCONCLUSIVE path, determinism).gpd-computerepo: 10 tests pass incl. the real subprocess sandbox (network/import escapes blocked, values match, special functions run).CI note
A
compare-experimentcodex prompt-projection budget was 8 chars over baseline (7708 > 7700) — pre-existing drift on base main, independent of this change (none of the files this PR touches are in that command's projection chain). Bumped the budget to 7750 inprojection_budget_support.pyso CI is green; no behavior change.