diff --git a/.github/.claude/rules/proteus-code-review.md b/.github/.claude/rules/proteus-code-review.md new file mode 100644 index 000000000..76079a051 --- /dev/null +++ b/.github/.claude/rules/proteus-code-review.md @@ -0,0 +1,106 @@ +--- +description: PROTEUS-specific code review criteria for the generator-evaluator pattern. Applies domain expertise (physics, coupling, pitfall patterns) to all code review in this repo. +--- + +# PROTEUS Code Review Criteria + +When reviewing PROTEUS code (either your own or via code-reviewer agents), apply these domain-specific checks in addition to standard code quality review. + +> **Discovery note.** PROTEUS keeps its Claude-Code rule files under `.github/.claude/rules/` (not the conventional repo-root `.claude/`) so they can be tracked in git and shared across collaborators. Claude does NOT auto-discover them at this path; the repo-root `CLAUDE.md` (symlinked to `.github/copilot-instructions.md`) names this file and `proteus-tests.md` explicitly. **Before opening any review pass, read both this file and `proteus-tests.md`.** + +## Physics plausibility + +- Temperature must be positive everywhere (Kelvin). Flag any code path where T could reach zero or go negative. +- Pressure must be positive and monotonically increasing with depth in interior profiles. +- Mass fractions must sum to 1.0. Flag any volatile partitioning code that doesn't enforce or verify normalization. +- Escape rates must not exceed total atmospheric mass. Flag unbounded escape calculations. +- Outgassing rates must be non-negative. +- Energy fluxes at module boundaries (atmosphere-interior, interior-core) must be consistent. If two modules independently compute the same flux, verify they agree. +- Stefan-Boltzmann: F = sigma * T^4. When reviewing radiative flux code, check the exponent is 4, not 3 or 5. + +## Unit convention boundaries + +PROTEUS has a split unit convention: +- **Config values**: "human" units (M_sun, bar, Gyr, K) +- **Internal hf_row values**: SI-ish (kg, Pa, yr, K) +- **Submodule APIs**: may expect either convention + +When reviewing code that passes values between config, hf_row, and submodule calls, verify the unit is correct at each boundary. The ZEPHYRUS stellar-mass bug (audit 1.1) was exactly this class of error. + +## Config mutability + +The `Config` attrs object must not be mutated at runtime. Flag any code that sets `config.X.Y = value` outside of config initialization. Use local variables instead. Known violation: Zalmoxis sets `config.orbit.module = 'dummy'`; this is a known debt, not a pattern to replicate. + +## Coupling parameter echo-back + +When module A computes a quantity self-consistently (e.g., Zalmoxis computes core mass from EOS) and module B has its own internal model for the same quantity (e.g., SPIDER's `-rho_core`), module B's output can overwrite A's value in hf_row. Review any new submodule integration for this pattern. + +## Whole-element aggregation symmetry (issue #677 lesson) + +When reviewing code that aggregates element masses, all four sites of the cycle must use the same element set: + +1. Initial-budget population (`calc_target_elemental_inventories`, `_resolve_oxygen_budget`) +2. M_planet bookkeeping (`update_planet_mass`) +3. Structure dry-mass target (`load_zalmoxis_configuration`) +4. Escape rate distribution (`calc_unfract_fluxes`, `calc_new_elements`) +5. Desiccation gate (`check_desiccation`) +6. First-call baseline (`M_vol_initial` in `run_escape`) + +Issue #677 surfaced when one site (M_atm via `gas_list` sum) implicitly included oxygen mass while every other site (M_ele via `element_list` with `if e == 'O': continue`) excluded it. The fix is to make the element set explicit and consistent across all aggregation sites. A new `if e == 'O': continue` skip in any of these sites is a red flag during review; it likely re-introduces the asymmetry. + +The runtime invariant `assert_mass_conservation(hf_row)` is called at the end of every iteration to hard-fail on a regression. If a review finds someone has weakened or removed that assertion, push back: it's the safety net that catches future O-skip reintroductions. + +## IC consistency checks at unit boundaries + +When a user supplies a value via config that gets re-derived by a downstream solver (e.g., O_budget from `planet.elements.O_mode` vs CALLIOPE's IC equilibrium), a one-shot reconciliation check at IC catches mis-specifications loudly rather than letting them silently corrupt the trajectory. Pattern from issue #677: + +1. Stash the user-supplied value in `hf_row` under a sentinel-style key (e.g., `O_kg_user_ic`). +2. After the first solver call, compare the solver-derived value against the user budget. +3. Hard-fail if relative divergence exceeds a threshold (50% for O; threshold can be tuned per case). +4. Flip the sentinel so subsequent init-stage calls don't re-fire the check. + +Applies to any future user-specified quantity that has a solver-derived equivalent. Examples worth retro-fitting: T_magma_init vs SPIDER's IC entropy, fO2_shift_IW vs the atmospheric chemistry it implies, surface gravity vs Zalmoxis's structure output. + +## hf_row temporary overrides + +When overriding hf_row values to pass different boundary conditions to a submodule, require a save/restore pattern: +```python +saved = {k: hf_row[k] for k in keys_to_override} +try: + hf_row[k] = override_value + result = call_submodule(hf_row) +finally: + hf_row.update(saved) +``` +Without restore, the helpfile CSV records override values instead of true planet state. + +## Cross-module constant duplication + +Physical constants (G, year length, solar mass, Stefan-Boltzmann) are defined independently in PROTEUS, CALLIOPE, ZEPHYRUS, and other submodules. When reviewing code that uses physical constants, check which definition is used and whether it matches the expected value. + +## PALEOS / EOS tables + +- SPIDER needs P-S tables (phase-specific S ranges, complete rectangles, uniform P spacing). +- Aragog needs P-T tables (full rectangular grid, identical for solid and melt). +- Phase-filtering P-T tables for Aragog creates irregular grids that cause scipy to use slow unstructured interpolation. Flag any table generation that filters by phase before writing Aragog tables. + +## Interior-atmosphere coupling timing + +The main loop advances Time before the atmosphere step runs. Any function comparing hf_row (current, time-advanced) with hf_all.iloc[-1] (previous) must get argument ordering right. + +## Validator liveness + +attrs validators can silently become dead code if they compare a dataclass instance against a primitive (e.g., `StopEscape is False`). When reviewing validators, check that both valid and invalid inputs are tested. + +## Test marker discipline + +Every test file must begin with a module-level `pytestmark = [pytest.mark., pytest.mark.timeout()]` (unit/30 s, smoke/60 s, integration/300 s, slow/3600 s). Per-function markers are additive but do not replace the module-level marker; CI runs `pytest -m "unit and not skip"` and any file missing the tier marker ships untested. + +## Test quality (cross-reference) + +Test-content rules (anti-happy-path, discriminating-value guards, physics-invariant tiering, `physics_invariant` / `reference_pinned` certification markers, adversarial-review trigger, mocking discipline, `importorskip` + module-constant-monkeypatch traps) live in [`proteus-tests.md`](proteus-tests.md). When reviewing tests, apply both files: this one for marker discipline and review-pass gate, the deep-dive for the content contract. + +## Sister rules (cross-link) + +- [`.github/copilot-instructions.md`](../../copilot-instructions.md) "Testing Standards" -- high-level rules visible to all readers. Repo-root `CLAUDE.md` is a symlink to this file. +- [`proteus-tests.md`](proteus-tests.md) -- test quality deep-dive; the canonical source for anti-happy-path patterns and the validation certification markers. diff --git a/.github/.claude/rules/proteus-tests.md b/.github/.claude/rules/proteus-tests.md new file mode 100644 index 000000000..522e20376 --- /dev/null +++ b/.github/.claude/rules/proteus-tests.md @@ -0,0 +1,401 @@ +--- +description: PROTEUS test quality deep-dive. Anti-happy-path patterns, discriminating-value guards, physics-invariant tiering, validation certification markers, adversarial-review trigger. Extends the Testing Standards section in `.github/copilot-instructions.md`. +--- + +# PROTEUS Test Quality Rules + +This file is the canonical deep-dive on test quality. The high-level summary lives in [`.github/copilot-instructions.md`](../../copilot-instructions.md) under "Testing Standards". The two files MUST stay in sync. If you change one, mirror the change in the other. + +> **Discovery note.** PROTEUS keeps its Claude-Code rule files under `.github/.claude/rules/` (not the conventional repo-root `.claude/`) so they can be tracked in git and shared across collaborators. Claude does NOT auto-discover them at this path; the repo-root `CLAUDE.md` (symlinked to `.github/copilot-instructions.md`) names this file and `proteus-code-review.md` explicitly so AI tooling and human readers know to load them. **When opening or editing any file under `tests/**` or `src/proteus/**`, read this file first.** + +Sister rule files: + +- [`.github/copilot-instructions.md`](../../copilot-instructions.md): high-level rules, applied repo-wide. +- [`.github/.claude/rules/proteus-code-review.md`](proteus-code-review.md): review-pass gate, domain-aware code review (physics plausibility, unit boundaries, hf_row override pattern, etc.). Test-marker discipline lives there too. + +PROTEUS is scientific simulation code and the test suite is held to physics-grade rigor. Tests exist to catch real bugs. A test that asserts the wrong thing, or that passes for the wrong reason, is worse than no test because it generates false confidence. The rules below codify what "real test" means here. + +--- + +## 1. Anti-happy-path rules (every new test) + +Every new test function MUST include: + +1. **At least one edge case**: a boundary value (Phi = 0 or 1, e = 0, T = T_solidus), an empty input, or an extreme physical parameter. +2. **At least one path that exercises the error contract**: + - If the function under test has documented validation (raises on negative T, refuses to dispatch with `module = None`, etc.), test that the error fires AND that no side effect ran. + - If the function has no validation (closed-form mathematics: orbital mechanics, thermodynamic relations), exercise the **limit-input behavior** (e = 0 is a fixed point, Imk2 = 0 leaves state unchanged) and assert the corresponding mathematical invariant. + - "No validation in source therefore no error test" is not an exemption; the limit-input substitute is. +3. **Assertion values NOT trivially derivable from the implementation**: discriminating numeric pins (see Section 2 below) or property-based assertions (monotonicity, conservation, symmetry, boundedness). + +### Forbidden patterns + +These are flagged by `tools/check_test_quality.py` and rejected at PR time. + +- **Single-assert test functions**. Two or more assertions per test; the second usually pins the invariant the first hand-waves over. Exception: a single assertion of a hard-fail invariant (mass closure within `1e-12`) is acceptable if the test is the only test of that invariant in the file. +- **Weak assertions when they stand alone as the sole meaningful check in the test.** The shapes are: + - `assert result is not None` + - `assert result > 0` + - `assert len(result) > 0` + - `assert isinstance(result, dict)` + - `assert result is None` where the function returns `None` implicitly + + Required carve-out: the three-class discrimination guard (Section 2) uses `assert val > 0` as the sign-error guard and `assert lo < val < hi` as the scale-error guard alongside a primary `pytest.approx(...)` pin. Those secondary lines look like weak assertions in isolation; they are NOT flagged when paired with a stronger primary assertion in the same test. The linter applies the carve-out automatically: weak shapes are flagged only when the test has exactly one `assert` statement (`len(asserts) == 1`) and that assertion is itself the weak shape. +- **Tests with no function-level docstring**. The docstring states which physical scenario or contract clause is being verified. +- **`==` adjacent to a float literal**. Use `pytest.approx(val, rel=...)` or `np.testing.assert_allclose(actual, expected, rtol=..., atol=...)`. Comparing two floats with `==` is a known flake source even for "exact" identities like 0.0 (-0.0 vs +0.0, NaN propagation). +- **Tests asserting on a fixture's implicit default**: e.g. `assert fixture_returning_none() is None`. This is trivially true. Delete the test; do not strengthen it by adding more `is None` assertions. + +--- + +## 2. Discriminating test values + +The test contract is: a regression that introduces a plausible bug must fail the test. "Plausible bug" means off-by-one exponent, wrong sign, swapped factor of 2, missing factor of pi, dimensionally-wrong unit. Pick input values where the wrong-formula result is far from the correct one. + +### Bad / good examples + +| Pattern | Bad (any-exponent-passes) | Good (discriminates) | +|---|---|---| +| `F = sigma * T**4` | Test at `T = 1` (any power of 1 is 1) | Test at `T = 300` and `T = 1500`; pin the ratio | +| Opacity interpolation | Test at grid nodes (interpolation is identity there) | Test at off-grid points where bilinear vs nearest-neighbor differ | +| Mass-fraction normalization | All species equal (1/N each, symmetric so order doesn't matter) | Asymmetric composition (one dominant species + traces) | +| Kepler period | `a = 1` (sma**1.5 = sma) | `a = 2` and `a = 1`; assert ratio = 2**1.5, not 2 or 8 | + +### Discrimination guard (REQUIRED for pinned-value tests) + +When a test pins a numeric value, include explicit assertions that the wrong-formula result would differ from the correct one for **each plausible bug class**. "Each plausible bug class" means at minimum: + +1. **Exponent or factor error** (off-by-one exponent, missing factor of 2 / pi). `abs(val - wrong_value)` discriminates. +2. **Sign error** (`-x` vs `+x`). `abs()` hides this; assert the sign explicitly with `val > 0` or `val < 0`. +3. **Unit-conversion error** (Pa vs bar, AU vs m). Pin the absolute scale with the unit named in the comment. + +**Carve-out for conservation-style invariants.** When the primary assertion IS a conservation closure (mass closure, energy balance, sum-equals-total), the equality form `sum(parts) == pytest.approx(total)` already discriminates exponent / factor errors by construction: any prefactor bug breaks closure at the same order as the discriminating term. In that case the exponent guard is satisfied by the conservation equality itself; sign and scale guards remain mandatory. + +Canonical pattern: + +```python +def test_de_dt_matches_closed_form_value_for_unit_params(): + val = de_dt(a=2.0, e=0.5, params=_UNIT_PARAMS) + expected = (21.0 / 2.0) * 0.5 / (2.0**6.5) # dimensionless, e-fraction per time-unit + assert val == pytest.approx(expected, rel=1e-12) + # Exponent-error guard: a regression to a**5 lands at 0.164, not 0.058. + wrong_a5 = (21.0 / 2.0) * 0.5 / (2.0**5) + assert abs(val - wrong_a5) > 0.05 + # Sign guard: positive Imk2 produces a positive RHS magnitude under the + # current sign convention. A flipped sign would fail this. + assert val > 0 + # Scale guard: order of magnitude is ~5e-2, not ~5e-5 (kg-vs-g conversion bug) + # or ~5e+2 (forgotten division). Pin the magnitude. + assert 1e-3 < val < 1.0 +``` + +The guard lines are mandatory whenever the test's primary assertion is a `pytest.approx` against a hand-calculated value. Property-based assertions (monotonicity, conservation, symmetry) do not need a separate guard because they are already discriminating across the input space; their own form is the guard. + +--- + +## 3. Physics-invariant assertions (tiered) + +### When required + +Every unit test on a **physics module** must assert at least one of the four invariants below. Physics modules are: + +``` +src/proteus/interior_struct/* +src/proteus/interior_energetics/* +src/proteus/atmos_clim/* +src/proteus/atmos_chem/* +src/proteus/escape/* +src/proteus/outgas/* +src/proteus/orbit/* +src/proteus/star/* +src/proteus/observe/* +src/proteus/inference/objective.py +src/proteus/inference/BO.py +src/proteus/inference/async_BO.py +``` + +Bayesian-optimization tests are physics-required because the objective evaluated by BO is the physics simulator itself: BO convergence is the property the simulator is being optimized against, and weak assertions on the BO loop have already shipped on this branch (the centered-target dead-test caught during the second review pass). + +Helpers, common code, and shared utilities under a physics directory (e.g. `interior_energetics/common.py`, `escape/common.py`, `outgas/common.py`) are STILL physics-required because their callers are physics. The carve-out is by source file purpose, not by filename: a pure-Python data plumbing helper in `outgas/_recompute_atm_mmw` is physics-required if its output is consumed by physics; only if a helper is purely structural plumbing (logging, path resolution, type coercion with no physical quantity) does the utility exemption apply. + +Utility modules are exempt from the physics-invariant requirement but still subject to all anti-happy-path rules: + +``` +src/proteus/utils/* +src/proteus/config/* +src/proteus/plot/* +src/proteus/cli.py +src/proteus/inference/utils.py +src/proteus/inference/{gen_D_init,plot}.py +src/proteus/grid/* +src/proteus/tools/* (when present) +``` + +### The four invariant families + +1. **Conservation** + - Mass closure: `M_atm + M_mantle + M_core <= M_planet`; per-species `species_kg_atm + species_kg_liquid + species_kg_solid ≈ species_kg_total`. + - Energy balance: LHS = RHS within stated tolerance for the energy ODE right-hand side. + - Angular-momentum conservation: total system AM constant when no external torque (satellite tests, tidal evolution). +2. **Positivity / boundedness** + - `T > 0` Kelvin everywhere, `P > 0` Pa everywhere. + - Mass fractions in `[0, 1]`; volume mixing ratios in `[0, 1]`. + - Outgassing rates non-negative; escape rate <= atmospheric mass at the current step. + - Melt fraction in `[0, 1]`; phi `<=` 1 at all depths. +3. **Monotonicity or symmetry** + - Pressure increases with depth in interior profiles. + - Density increases with pressure at fixed entropy. + - Reversing time integration recovers the IC (time-symmetry of conservative ODEs). + - Doubling stellar mass while fixing semimajor axis shortens the orbital period by `sqrt(2)`. +4. **Pinned numeric value with a discrimination guard**: see Section 2. This is acceptable as the sole invariant when a closed-form result is the contract. + +Property-based assertions (monotonicity, conservation, symmetry, boundedness) are preferred over point-value pins when both are possible. They hold for any valid input and so catch bugs across the entire input space, not just at the chosen test point. + +### Validation certification markers + +Two markers track validation quality independently of line coverage: + +- **`@pytest.mark.physics_invariant`** -- this test asserts at least one of the four invariants. Tag every qualifying test in a physics module. `tools/check_test_quality.py` warns when a physics-module test asserts no invariant and is not tagged. +- **`@pytest.mark.reference_pinned`** -- this test pins behavior against a **published benchmark** (paper, figure, table; cite explicitly in the test docstring), an **analytical limit** (Stefan-Boltzmann black-body limit, hydrostatic equilibrium, isentropic Kepler relations), or a **cross-implementation cross-check** (SPIDER vs Aragog at the same IC, CALLIOPE vs atmodeller at the same fO2 shift). + - **Per-source-file**: each source file in a physics-module directory whose public API computes or returns a physical quantity must have at least one `reference_pinned` test against it. Granularity is per source file, not per directory: `interior_energetics/aragog.py` and `interior_energetics/spider.py` each need their own pinned test, even though they live in the same directory. + - **Tracking**: pinned tests are inventoried in `docs/Validation//.md`, one markdown page per source file. The page records: the source under test, the reference cited, the test ids carrying the marker, and the date of last comparison against the source. + - **Audit**: `tools/check_test_quality.py --reference-pinned-audit` currently reports per-directory coverage; per-file granularity is enforced by manual cross-reference against `docs/Validation/`. Extending the audit to per-file resolution is a follow-up. + +Both markers are registered in `pyproject.toml` under `[tool.pytest.ini_options] markers`. They do not gate CI on their own; their coverage is a separate KPI surfaced in the PR summary comment. + +--- + +## 4. Mocking discipline + +- Default to `unittest.mock` for ALL external calls in unit tests: SOCRATES, AGNI, SPIDER, Aragog solver, Zalmoxis solver, file I/O, HTTP, subprocess. +- Mock at the **narrowest scope**: patch the specific function (`unittest.mock.patch('proteus.foo.calc_X')`), not the whole module. +- A mocked physics function MUST return **physically plausible** values. A mock that returns `0.0` or `1.0` for everything will mask sign / clamp / fallback bugs. +- NEVER mock the function under test. If you're tempted to, the test is asking the wrong question. +- Smoke tests use real binaries; integration tests use real submodules. The rules in this file still apply to those tiers, but the mocking discipline is relaxed because the real call is the contract. + +--- + +## 5. Optional-dependency imports + +Any test that imports an optional dependency MUST call `pytest.importorskip` at module top so Docker `--no-deps` runs do not fail collection: + +```python +import pytest + +hypothesis = pytest.importorskip('hypothesis') +# ... or for a module-level helper that requires the dep: +pytest.importorskip('boreas') +``` + +Optional deps that have hit this trap on `tl/interior-refactor`: `hypothesis` (three times), `boreas`, `atmodeller`, `lovepy`, `mors`, `vulcan`, `zalmoxis` (when not installed via editable). + +The lint script (`tools/check_test_quality.py`) enforces this. Rule key `missing_importorskip`: any module-top `import ` or `from import ...` that is not preceded by a module-scope `pytest.importorskip('')` is flagged. The check covers `hypothesis`, `boreas`, `atmodeller`, `lovepy`, `mors`, `vulcan`, `zalmoxis`. + +--- + +## 6. Module-level constants and `monkeypatch` + +When the source under test reads an env var into a **module-level constant** at import time, `monkeypatch.setenv` alone is not sufficient. The constant is frozen at the original import. + +Trap from `tl/interior-refactor`: + +```python +# Source: src/proteus/utils/data.py +FWL_DATA_DIR = Path(os.environ.get('FWL_DATA', '...')) # frozen at import +``` + +```python +# Test (wrong): +monkeypatch.setenv('FWL_DATA', str(tmp_path)) # too late; constant already cached + +# Test (right): +monkeypatch.setenv('FWL_DATA', str(tmp_path)) # for downstream code that re-reads +monkeypatch.setattr('proteus.utils.data.FWL_DATA_DIR', tmp_path, raising=False) +``` + +When in doubt, do both. The lint script does NOT currently flag this pattern (it would require source-side analysis to know which constants are env-derived); this is a discipline rule enforced via the >50 LOC review trigger and the recurring-trap table in section 14. + +--- + +## 7. Marker discipline and timeouts + +### Module-level marker is mandatory + +Every test file MUST begin with: + +```python +import pytest + +pytestmark = [pytest.mark., pytest.mark.timeout()] +``` + +with budgets: + +- `unit` -> `timeout(30)` (target wall-time per test is `< 100 ms`; the 30 s cap is a defensive net). +- `smoke` -> `timeout(60)` (target `< 30 s`). +- `integration` -> `timeout(300)`. +- `slow` -> `timeout(3600)`. + +CI runs `pytest -m "unit and not skip"`. Tests without the tier marker are invisible to CI and shipped untested. The lint script blocks any file missing the module-level `pytestmark`. + +### Per-function markers + +Per-function `@pytest.mark.` markers are **additive**, not a replacement for the module-level marker. They are useful when a file's tests span multiple tiers (rare; prefer separate files). + +### Timeout is a safety net, not a target + +The `timeout` ceiling exists so a future regression that introduces a hang (real solver call, infinite loop, network retry) surfaces as a specific-test failure rather than a generic job timeout. Current test wall times are 100x below the ceiling; if you find yourself needing the full 30 s for a unit test, something has gone wrong and you should reduce scope or move the test to a slower tier. + +--- + +## 8. Float and numerical comparison + +- NEVER use `==` for floats. Use `pytest.approx(val, rel=1e-5)` or `np.testing.assert_allclose(actual, expected, rtol=..., atol=...)`. +- State the tolerance rationale in a comment when the choice is non-obvious. E.g. "`rtol=1e-3` because the Cp lookup truncates entries to 4 sig fig". +- For pinned numeric values, include a **discrimination guard** (Section 2). +- For property-based assertions, use `pytest.approx` against the exact symbolic relation, with the tightest tolerance the implementation can hit (typically `rel=1e-12` for closed-form algebra; looser for ODE integrations). + +--- + +## 9. Voice rule for test artifacts + +The repo-wide voice rule (zero AI-process disclosure in any public artifact) applies to test code with the same strictness as to source. The voice rule is **scoped** to public artifacts other contributors and external readers see; it does NOT apply to the rule documents themselves (this file, `proteus-code-review.md`, `copilot-instructions.md`), which legitimately name the procedures they define. + +In scope (the voice rule is BANNED here): + +- Test-skip reasons (`@pytest.mark.skip(reason='...')`). +- Test-file docstrings. +- Test-function and test-class names. +- Test-function docstrings. +- Parametrize ids (`@pytest.mark.parametrize('name', [...], ids=[...])`). +- Log-capture assertions (regex against `caplog.records`). +- Commit messages on test-touching commits (subject AND body). +- **Pull-request titles and bodies on test-touching PRs** (`gh pr create --title ... --body ...`, edits via the GitHub UI). +- GitHub Actions job names and step names that ship to the PR Checks tab. +- Inline source comments and docstrings on `src/proteus/**`. +- Log strings that ship with the repo (anything that ends up in `proteus_00.log` or a shipped artifact). + +Out of scope (these may NAME the procedures they define): + +- This file (`proteus-tests.md`). +- `proteus-code-review.md`. +- `copilot-instructions.md`. +- Any documentation file whose CONTENT describes the rule infrastructure (`docs/How-to/test_*.md`, `docs/How-to/ai_usage.md`, validation pages under `docs/Validation/`, future rule-meta docs). The scope test is **what the prose is about**, not what the path is. A doc that explains how the testing rules work may name the testing rules; a doc that explains user-facing simulator behavior must keep the voice rule. + +Banned phrases inside the in-scope artifacts: "audit", "review pass", "adversarial review", "Phase X" (when "X" is an AI-organized roadmap label, not a real project phase), "T1.x", "Group A/B/C/D" (when AI-organized work groups), `claude-config/...` paths, "Generated with Claude", AI-tool names, em-dashes, en-dashes (except in bibliographic page ranges within citations), process meta-commentary ("after careful analysis"). + +Write the OUTCOME (what the test verifies; what the PR achieves) never the PROCESS (how the rule was derived; which review caught what). First-person Tim voice. Going-forward only, no history rewrite. + +--- + +## 10. Fixture and parameter conventions + +- Use SI units in test parameters unless the function under test explicitly expects config units (M_sun, bar, Gyr, K). +- Use the conftest parameter classes (`EarthLikeParams`, `UltraHotSuperEarthParams`, `IntermediateSuperEarthParams`) for realistic test scenarios. They are session-scoped and cheap. +- Use `@pytest.mark.parametrize` when the same logic spans multiple physical regimes (Earth-like, super-Earth, sub-Neptune). Each parametrize id must read like a physical scenario, not a tuple of numbers. +- Set seeds for any randomness: + ```python + np.random.seed(42) + torch.manual_seed(42) + random.seed(42) + ``` + All three must be seeded if all three are used. Today's BO convergence regression confirmed: seeding only `torch.manual_seed` is not enough. +- Use `tmp_path` (pytest fixture) for temporary files. Do not produce large outputs in the test path. + +--- + +## 11. Documentation per test + +- **File-level docstring**: name the module under test, list the invariants and contract clauses the file exercises, link to the three test docs. Required. +- **Function-level docstring**: state the physical scenario or contract clause in plain language. Required (lint-enforced). +- **Inline comments**: explain **why** a specific input range was chosen ("T=300 K and T=1500 K so the T**3 vs T**4 difference is resolved well above tolerance"). + +--- + +## 12. Naming + +- Test names describe behavior, not the called function: `test_opacity_monotonic_with_temperature`, NOT `test_get_opacity`. +- Test names use snake_case and read as full sentences. +- Group related tests in classes (`class TestOpacity:`) when they share setup; use the class to thread a single fixture through several scenarios. + +--- + +## 13. Adversarial review trigger + +A pull request that adds or substantially modifies **> 50 lines of test code across all its commits** triggers an independent review pass before merge. This is a discipline rule, not CI-automated: the author runs the review pass via a `code-reviewer` agent before pushing the final test-touching commit. The denominator is PR-level, not per-commit: `git diff origin/main...HEAD -- 'tests/**'` is the source of truth. Splitting one large change into 49 + 49 + 49 line commits does NOT dodge the trigger. + +The reviewer's mandate: + +- Cite the anti-happy-path rule (Section 1) and the discrimination-guard requirement (Section 2). +- Flag single-assert tests, weak `is not None` patterns, missing module-level marker, missing `physics_invariant` tag on a physics-module test, missing `reference_pinned` tag on a per-module benchmark test, dead tests (passes for the wrong reason), tests that mock the function under test. +- Verify discriminating values: re-derive the expected value from a plausible wrong formula and assert the test fails with that wrong formula. +- Verify physics module coverage of the four invariants: which of the four does this test exercise? If none, why is the test in `tests//`? + +The reviewer is a separate process from the test author. For Claude-Code workflow this means spawning a `proteus-review` skill or a `code-reviewer` agent with the test files in scope; the review must complete and surface findings before the test commit is pushed. + +The review's findings are addressed in a follow-up commit (not amended into the test commit). The follow-up subject line is in plain language describing the OUTCOME ("sharpen orbital period assertions to distinguish a**1.5 from a**1.0", NOT "address review findings"). + +--- + +## 14. Tooling + +The repo provides: + +- `bash tools/validate_test_structure.sh` -- structural check (`tests/` mirrors `src/proteus/`). +- `python tools/check_test_quality.py --check` -- CI mode: AST scan for the forbidden patterns in Section 1 and the marker requirement in Section 7. Fails the PR if NEW violations exceed the baseline. +- `python tools/check_test_quality.py --baseline` -- after a deliberate sweep, regenerates `tools/test_quality_baseline.json`. Only run when you have intentionally reduced violations. +- `python tools/check_test_quality.py --reference-pinned-audit` -- prints physics modules missing a `reference_pinned` test. +- `bash tools/coverage_analysis.sh` -- coverage by module, sorted by gap. +- `ruff check src/ tests/` and `ruff format src/ tests/` -- run before commit. + +The lint script is wired into PR CI (`ci-pr-checks.yml`). The step currently runs with `continue-on-error: true` while the legacy baseline is being swept; the gate becomes blocking by removing that flag once the baseline approaches zero. Even today the script writes a regression-status table to the workflow log, so a contributor adding a new violation sees the failure inline. + +--- + +## 15. Coverage strategy (operator's view) + +PROTEUS uses two coverage gates with explicit sub-targets. The fast gate is for PR cycle time; the estimated total is the real KPI. + +| Gate | Tests | Target | When | +|---|---|---|---| +| Fast gate (`tool.proteus.coverage_fast`) | unit + smoke | ratcheting toward **90%** (expected plateau around 60-75% because wrapper code requires real binaries) | Every PR | +| Estimated total (PR union with nightly artifact) | unit + smoke + integration | **90%** (PROTEUS-ecosystem ceiling) | Every PR | +| Full gate (`tool.coverage.report`) | unit + smoke + integration + slow | **90%** | Nightly | +| Diff-cover | changed lines | 80% (hard-coded) | Every PR | + +Unit tests alone are not expected to reach 90%. Wrapper code that requires real binaries (SOCRATES, AGNI, SPIDER) is covered by smoke / integration / slow tests in nightly; the nightly artifact is downloaded into PR runs and unioned with the PR's own coverage to estimate the total. + +What this means for adding tests: + +- A new function in a physics module that wraps a real binary: write a unit test with mocks (counts toward fast gate), AND a smoke or integration test with the real binary (counts toward estimated total / full gate). +- A new closed-form helper in a utility module: a unit test is sufficient. +- A new orchestration function in `proteus.py` or `cli.py`: a unit test for argument parsing and dispatch, plus an integration test for the actual call path. + +The ratchet is one-way (`tools/update_coverage_threshold.py`), capped at 90%. Never manually decrease the threshold. + +--- + +## 16. Failure modes to recognize on review + +These are real patterns that have shipped in the past. The lint script catches some of them mechanically; reviewers catch the rest. + +| Pattern | Example | Why it slipped | Fix | +|---|---|---|---| +| Silent skip in helper | `def _enum_for(field): ...; if actual is None: continue` masks broken introspection | Helper hides a real failure as a no-op | Hard assertion: `assert actual is not None, ...` | +| Centered-target convergence | BO test asserts `(initial_best - final_best) / initial_best > 0.5` while the target is `(0.5, 0.5)`; a regression returning constant `(0.5, 0.5)` passes | The "improvement" metric is dominated by where the target sits | Add a proximity check: `||x_best - target|| < 0.5 * initial_min_dist` | +| Log-line-only assertion | Test captures a log line and asserts on its text; a regression that changes the code path but keeps the log still passes | Logs are not the contract | Capture the call kwarg and assert on the value passed in | +| Module-level constant patched only via env var | `monkeypatch.setenv('FWL_DATA', ...)` on a source that read it at import time | Constants are frozen at import; setenv is too late | `monkeypatch.setattr('mod.CONST', ...)` in addition to setenv | +| Optional dep imported unconditionally | `import hypothesis` at module top | Docker `--no-deps` build skips the optional install | `pytest.importorskip('hypothesis')` at module top | +| Stale marker after refactor | File moved from `interior/` to `interior_energetics/`, kept `@pytest.mark.unit` but missing the module-level `pytestmark` | CI marker filter still passed because of per-function markers; coverage tier became invisible | Add module-level `pytestmark = [pytest.mark.unit, pytest.mark.timeout(30)]` | +| Trivially-true on implicit None | `def fixture(): pass`; `def test_x(fixture): assert fixture is None` | Fixture returned None implicitly; test passes for the wrong reason | Delete the test | +| Hidden Zalmoxis equilibration loop | Slow-tier test pairs `interior_struct.module='zalmoxis'` with dummy outgas and exceeds its wall-time budget by 50-100x | `equilibrate_initial_state` defaults to True; iterates CALLIOPE + Zalmoxis up to 15 times before the main loop. Dummy outgas keeps `P_surf ≈ 0`, so `dP/P` never converges and the loop runs to the cap. Plus the per-iteration `update_structure_from_interior` composition trigger fires every iteration on dummy outgas. | Use `**minimal_zalmoxis_overrides()` from `tests/integration/conftest.py` to disable both code paths together. The helper sets `equilibrate_init=False`, `update_interval=0`, and the three refresh-trigger thresholds to 0.999 / 0 for defence-in-depth. | + +When you spot a new variant of these, add it here. + +--- + +## 17. Sister rules (cross-link) + +- `.github/copilot-instructions.md` "Testing Standards" -- the high-level summary readers without `tests/**` context see first. +- `.claude/rules/proteus-code-review.md` "Test marker discipline" -- the review-pass gate that backs up the rules in this file. Also contains domain-aware physics checks (Stefan-Boltzmann exponent, hf_row override pattern, IC consistency, whole-element aggregation symmetry) that apply when reviewing the **source** code that tests cover. + +Any change to the rule set: update both files in the same commit and call out the cross-reference in the commit body. diff --git a/.github/actions/setup-proteus/action.yml b/.github/actions/setup-proteus/action.yml new file mode 100644 index 000000000..7c16a0ba4 --- /dev/null +++ b/.github/actions/setup-proteus/action.yml @@ -0,0 +1,403 @@ +name: 'Set up PROTEUS environment' +description: > + Cross-platform PROTEUS environment setup. Installs Python + Julia, the + system packages SOCRATES needs, the editable Python submodules, the + AGNI clone (at the pinned ref), and primes the SOCRATES build + Julia + depot + FWL_DATA caches. Exports RAD_DIR, FWL_DATA, AGNI_DIR, + JULIA_DEPOT_PATH as workflow env vars. + +inputs: + python-version: + description: Python version to install + required: false + default: '3.12' + julia-version: + description: Julia version (AGNI requires 1.11.x) + required: false + default: '1.11.2' + install-editable-submodules: + description: > + Whether to clone MORS / aragog / JANUS / CALLIOPE / ZEPHYRUS / Zalmoxis + and pip install them editable. Set to 'false' to rely solely on the + PyPI / git URLs in pyproject.toml dependencies. + required: false + default: 'false' + full-data: + description: > + If 'true', download the integration-tier data set (ARAGOG tables, + melting curves, stellar tracks). If 'false', only the smoke-tier + data is downloaded (spectral file + minimal stellar spectrum). + required: false + default: 'false' + +outputs: + rad-dir: + description: 'Absolute path to the SOCRATES install (sets RAD_DIR).' + value: ${{ steps.export-env.outputs.rad_dir }} + fwl-data: + description: 'Absolute path to the FWL_DATA root.' + value: ${{ steps.export-env.outputs.fwl_data }} + +runs: + using: composite + steps: + # ------------------------------------------------------------------ + # 0. Reclaim disk on Linux runners + # ------------------------------------------------------------------ + # GHA ubuntu-latest images ship with ~14 GB free and most of that + # gets eaten by Android SDK + .NET + ghc + CodeQL + Haskell tools + # that PROTEUS does not use. The aragog + atmodeller + Julia + + # FWL_DATA + SOCRATES install chain has been observed to push the + # remaining headroom to ~61 MB on integration-tier runs, which + # then trips Errno 28 during a downstream pip install. Pruning the + # unused preinstalled toolchains up front buys ~25 GB. macOS is + # untouched (different image layout, smaller pip wheels, and no + # observed disk pressure on this branch). + - name: Free disk space on Linux + if: runner.os == 'Linux' + shell: bash + run: | + echo "Before cleanup:" + df -h / + sudo rm -rf /usr/share/dotnet \ + /usr/local/lib/android \ + /opt/ghc \ + /opt/hostedtoolcache/CodeQL \ + /opt/hostedtoolcache/go \ + /opt/hostedtoolcache/PyPy \ + /opt/hostedtoolcache/Ruby \ + /usr/local/share/boost \ + "$AGENT_TOOLSDIRECTORY/Ruby" 2>/dev/null || true + echo "After cleanup:" + df -h / + + # ------------------------------------------------------------------ + # 1. System packages + # ------------------------------------------------------------------ + - name: Install system packages + shell: bash + run: | + set -euo pipefail + if [[ "$RUNNER_OS" == "Linux" ]]; then + sudo apt-get update -qq + sudo apt-get install -y --no-install-recommends \ + gfortran \ + libnetcdff-dev \ + netcdf-bin \ + libssl-dev \ + curl + elif [[ "$RUNNER_OS" == "macOS" ]]; then + # gcc bundles gfortran on macOS. netcdf-fortran provides nf-config. + # Tolerate brew post-install warnings: on the macos-latest + # Sequoia images, ``brew install`` can exit 1 with all + # packages successfully poured if any post-install step emits + # the "did not complete successfully" warning. The packages + # themselves are functional; verify each is on the prefix + # list before accepting the failure. + set +e + brew install --quiet gcc netcdf netcdf-fortran openssl@3 + brew_rc=$? + set -e + missing=() + for pkg in gcc netcdf netcdf-fortran openssl@3; do + brew list "$pkg" >/dev/null 2>&1 || missing+=("$pkg") + done + if [ ${#missing[@]} -gt 0 ]; then + echo "ERROR: brew install rc=$brew_rc and missing packages: ${missing[*]}" >&2 + exit "${brew_rc:-1}" + fi + if [ "$brew_rc" -ne 0 ]; then + echo "brew install exited $brew_rc but all packages present; continuing" + fi + # Homebrew installs gfortran as a versioned binary (gfortran-15); + # the plain `gfortran` symlink is not always present when gcc was + # pre-installed on the runner image. Create one so SOCRATES's + # `command -v gfortran` check succeeds. + if ! command -v gfortran >/dev/null 2>&1; then + gfortran_bin="$(ls /opt/homebrew/bin/gfortran-* 2>/dev/null | sort -V | tail -n 1)" + if [ -z "$gfortran_bin" ]; then + gfortran_bin="$(ls /usr/local/bin/gfortran-* 2>/dev/null | sort -V | tail -n 1)" + fi + if [ -z "$gfortran_bin" ]; then + echo "ERROR: no gfortran-* binary found after brew install gcc" >&2 + ls -la /opt/homebrew/bin/gfortran* /usr/local/bin/gfortran* 2>/dev/null || true + exit 1 + fi + target_dir="$(dirname "$gfortran_bin")" + ln -sf "$gfortran_bin" "$target_dir/gfortran" + echo "Linked $gfortran_bin -> $target_dir/gfortran" + fi + gfortran --version | head -n 1 + else + echo "Unsupported RUNNER_OS=$RUNNER_OS" >&2 + exit 1 + fi + + # ------------------------------------------------------------------ + # 1b. Cache brew downloads on macOS (Decision 9) + # ------------------------------------------------------------------ + - name: Cache Homebrew downloads (macOS) + if: runner.os == 'macOS' + uses: actions/cache@v4 + with: + path: | + ~/Library/Caches/Homebrew/downloads + key: brew-${{ runner.os }}-${{ runner.arch }}-${{ hashFiles('.github/actions/setup-proteus/action.yml') }} + restore-keys: | + brew-${{ runner.os }}-${{ runner.arch }}- + + # ------------------------------------------------------------------ + # 2. Miniforge + conda environment (matches the developer install) + # ------------------------------------------------------------------ + # Conda-forge owns SUNDIALS + scikits.odes, which the production + # aragog CVODE path requires. Installing via pip would need SUNDIALS + # 7.x from source on Linux (apt ships 6.x) and an MPI build on + # macOS; conda-forge provides matched binaries on both platforms. + - name: Set up miniforge + proteus env + uses: conda-incubator/setup-miniconda@v3 + with: + miniforge-version: latest + python-version: ${{ inputs.python-version }} + activate-environment: proteus + channels: conda-forge + channel-priority: strict + auto-activate-base: false + + # Install the conda-side dependencies that need native libraries + # (SUNDIALS for CVODE, scikits.odes for the Python bindings). The + # rest of PROTEUS installs via pip into the same env in step 8. + - name: Install conda dependencies + shell: bash -el {0} + run: | + set -euo pipefail + mamba install -n proteus -c conda-forge -y sundials 'scikits.odes>=3.0.0' + python -c "import scikits.odes; print('scikits.odes OK, version', scikits.odes.__version__)" + + # ------------------------------------------------------------------ + # 3. Julia (AGNI needs 1.11.x) + # ------------------------------------------------------------------ + - name: Set up Julia + uses: julia-actions/setup-julia@v2 + with: + version: ${{ inputs.julia-version }} + + # ------------------------------------------------------------------ + # 4. Resolve pinned refs from pyproject.toml (for cache keys) + # ------------------------------------------------------------------ + - name: Read module pins + id: pins + shell: bash -el {0} + run: | + set -euo pipefail + echo "agni_ref=$(python tools/_module_pins.py agni ref)" >> "$GITHUB_OUTPUT" + echo "agni_url=$(python tools/_module_pins.py agni url)" >> "$GITHUB_OUTPUT" + echo "socrates_ref=$(python tools/_module_pins.py socrates ref)" >> "$GITHUB_OUTPUT" + + # ------------------------------------------------------------------ + # 5. SOCRATES build cache + # ------------------------------------------------------------------ + - name: Cache SOCRATES build + id: cache-socrates + uses: actions/cache@v4 + with: + path: socrates/ + key: socrates-${{ runner.os }}-${{ runner.arch }}-${{ steps.pins.outputs.socrates_ref }}-${{ hashFiles('tools/get_socrates.sh') }} + restore-keys: | + socrates-${{ runner.os }}-${{ runner.arch }}-${{ steps.pins.outputs.socrates_ref }}- + socrates-${{ runner.os }}-${{ runner.arch }}- + + - name: Build SOCRATES (cache miss) + if: steps.cache-socrates.outputs.cache-hit != 'true' + shell: bash + run: | + set -euo pipefail + # Disable SSH detection in the script -- CI always uses HTTPS. + export GIT_SSH_COMMAND='ssh -o StrictHostKeyChecking=no -o ConnectTimeout=1 -o BatchMode=yes' + bash tools/get_socrates.sh + + # ------------------------------------------------------------------ + # 6. AGNI clone (the source tree itself; Julia depot is a separate cache) + # ------------------------------------------------------------------ + - name: Cache AGNI clone + id: cache-agni + uses: actions/cache@v4 + with: + path: AGNI/ + key: agni-clone-${{ runner.os }}-${{ steps.pins.outputs.agni_ref }} + restore-keys: | + agni-clone-${{ runner.os }}- + + - name: Clone AGNI (cache miss) + if: steps.cache-agni.outputs.cache-hit != 'true' + shell: bash + run: | + set -euo pipefail + rm -rf AGNI + git clone "${{ steps.pins.outputs.agni_url }}" AGNI + git -C AGNI checkout --quiet "${{ steps.pins.outputs.agni_ref }}" + + - name: Pin AGNI to configured ref (cache hit reconcile) + if: steps.cache-agni.outputs.cache-hit == 'true' + shell: bash + run: | + set -euo pipefail + # Even on a cache hit, the cached clone may have been keyed under + # a restore-key fallback; reconcile by fetching + checking out the + # pinned ref so the AGNI tree always matches the manifest. + cur="$(git -C AGNI rev-parse HEAD 2>/dev/null || echo unknown)" + if [ "$cur" != "${{ steps.pins.outputs.agni_ref }}" ]; then + git -C AGNI fetch --quiet origin + git -C AGNI checkout --quiet "${{ steps.pins.outputs.agni_ref }}" + fi + echo "AGNI at $(git -C AGNI rev-parse --short HEAD)" + + # ------------------------------------------------------------------ + # 7. Julia depot cache (AGNI's Pkg.instantiate output) + # ------------------------------------------------------------------ + - name: Cache Julia depot + id: cache-julia + uses: actions/cache@v4 + with: + path: ~/.julia + key: julia-${{ runner.os }}-${{ runner.arch }}-${{ inputs.julia-version }}-${{ steps.pins.outputs.agni_ref }}-${{ hashFiles('AGNI/Project.toml','AGNI/Manifest.toml') }} + restore-keys: | + julia-${{ runner.os }}-${{ runner.arch }}-${{ inputs.julia-version }}-${{ steps.pins.outputs.agni_ref }}- + julia-${{ runner.os }}-${{ runner.arch }}-${{ inputs.julia-version }}- + + - name: Build AGNI (Pkg.instantiate) + shell: bash + run: | + set -euo pipefail + # AGNI's own src/get_agni.sh starts with `git pull`, which fails on + # the detached-HEAD checkout we pinned via tools/_module_pins.py. + # Run the build steps directly so we keep the pinned ref intact. + # RAD_DIR is exported by the later "Export environment" step but + # AGNI's deps/build.jl reads it inline, so we set it here too. + # + # Order matters: Pkg.instantiate must run BEFORE deps/build.jl, + # because deps/build.jl includes SOCRATES's clean_wrappers.jl + # which does `using Glob`. With AGNI's project activated and + # instantiated, Glob (and every other dep in the manifest) is + # available to the build script. Running the build first would + # invoke `using Glob` against Julia's default environment, + # which carries no AGNI deps and fails with + # "Package Glob not found in current path." + # + # Keep the committed Manifest.toml in place: deleting it forces + # Pkg to re-resolve the dependency graph and re-precompile every + # package even on a Julia depot cache hit (~4 min penalty per + # run). The depot cache key already hashes Project.toml and + # Manifest.toml, so any genuine manifest change invalidates + # correctly. The pinned AGNI ref ships a matching Manifest. + export RAD_DIR="$GITHUB_WORKSPACE/socrates" + cd AGNI + julia --project=. -e 'using Pkg; Pkg.instantiate()' + julia --project=. deps/build.jl + + # ------------------------------------------------------------------ + # 8. PROTEUS pip install (resolves PyPI + git deps in pyproject.toml) + # ------------------------------------------------------------------ + - name: Install PROTEUS + shell: bash -el {0} + run: | + set -euo pipefail + python -m pip install --upgrade pip + pip install -e ".[develop]" + + # ------------------------------------------------------------------ + # 9. FWL_DATA cache + smoke / full data download + # ------------------------------------------------------------------ + - name: Cache FWL_DATA + uses: actions/cache@v4 + with: + path: ~/fwl_data + key: fwl-data-${{ runner.os }}-${{ hashFiles('.github/data-manifest.yaml') }} + restore-keys: | + fwl-data-${{ runner.os }}- + + - name: Download smoke data + shell: bash -el {0} + env: + FWL_DATA: ${{ github.workspace }}/../fwl_data + run: | + set -euo pipefail + # The cache restored ~/fwl_data; symlink it to the workspace-local + # path the test fixtures expect, so we don't have to special-case + # absolute paths across platforms. + mkdir -p "$HOME/fwl_data" + export FWL_DATA="$HOME/fwl_data" + echo "FWL_DATA=$FWL_DATA" >> $GITHUB_ENV + python -c " + from proteus.utils.data import ( + download_spectral_file, + download_stellar_spectra, + download_exoplanet_data, + download_massradius_data, + ) + download_spectral_file('Dayspring', '16') + download_stellar_spectra(folders=('solar',)) + download_exoplanet_data() + download_massradius_data() + " + + - name: Download full integration data + if: inputs.full-data == 'true' + shell: bash -el {0} + env: + FWL_DATA: ${{ env.FWL_DATA }} + run: | + set -euo pipefail + python - <<'PY' + from proteus.config import read_config_object + from proteus.utils.data import download_sufficient_data, download_eos_dynamic + + # Reference configs that drive the integration / slow tier downloads. + for cfg_path in ('input/all_options.toml', + 'tests/integration/aragog_janus.toml'): + try: + cfg = read_config_object(cfg_path) + if cfg_path.endswith('aragog_janus.toml'): + cfg.atmos_clim.module = 'agni' + download_sufficient_data(cfg, clean=False) + except Exception as exc: + print(f'warning: download_sufficient_data({cfg_path}) -> {exc}') + + try: + download_eos_dynamic('WolfBower2018_MgSiO3') + except Exception as exc: + print(f'warning: download_eos_dynamic -> {exc}') + PY + + # ------------------------------------------------------------------ + # 10. Final env export so calling jobs see RAD_DIR / FWL_DATA / AGNI_DIR + # ------------------------------------------------------------------ + - name: Export environment + id: export-env + shell: bash -el {0} + run: | + set -euo pipefail + RAD_DIR="$GITHUB_WORKSPACE/socrates" + FWL_DATA="${FWL_DATA:-$HOME/fwl_data}" + AGNI_DIR="$GITHUB_WORKSPACE/AGNI" + echo "RAD_DIR=$RAD_DIR" >> $GITHUB_ENV + echo "FWL_DATA=$FWL_DATA" >> $GITHUB_ENV + echo "AGNI_DIR=$AGNI_DIR" >> $GITHUB_ENV + # juliacall: prefer the Julia we just installed via setup-julia. + if command -v julia >/dev/null 2>&1; then + JULIA_BINDIR="$(dirname "$(command -v julia)")" + echo "JULIA_BINDIR=$JULIA_BINDIR" >> $GITHUB_ENV + echo "PYTHON_JULIACALL_BINDIR=$JULIA_BINDIR" >> $GITHUB_ENV + echo "PYTHON_JULIACALL_HANDLE_SIGNALS=yes" >> $GITHUB_ENV + fi + echo "rad_dir=$RAD_DIR" >> $GITHUB_OUTPUT + echo "fwl_data=$FWL_DATA" >> $GITHUB_OUTPUT + echo "" + echo "===== setup-proteus done =====" + echo " RAD_DIR = $RAD_DIR" + echo " FWL_DATA = $FWL_DATA" + echo " AGNI_DIR = $AGNI_DIR" + echo " python = $(command -v python) $(python --version)" + echo " julia = $(command -v julia) $(julia --version 2>&1 | head -1)" + echo " gfortran = $(command -v gfortran || true)" + echo " nf-config = $(command -v nf-config || true)" + echo " socrates = $(ls socrates/bin/radlib.a 2>/dev/null || echo MISSING)" diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 09007c7dd..db6f74df2 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -16,11 +16,19 @@ Follow the same standards for testing, coverage, code quality, and infrastructur ## High-Level Instructions -1. **Always** follow the testing standards outlined in this document and `docs/How-to/test_infrastructure.md` for all code changes. -2. **Always** inform yourself of the current project memory in `.github/copilot-memory.md` before making changes. -3. **Always** inform the user that you are reading in this file by printing a message at the start of your response: "(Read in copilot-instructions.md...)" -4. When creating a PR, **always** follow the PR template and ensure all sections are filled out with relevant information. -5. **Claude-specific**: `CLAUDE.md` is a symlink to this file. Store all project memories and session learnings in `.github/copilot-memory.md` (not in `.claude/` subdirectories), following the structure and conventions already established there. +> ### Rule files you MUST read on every session +> +> PROTEUS keeps its Claude-Code rule files under `.github/.claude/rules/` (NOT the conventional repo-root `.claude/`, which is gitignored and so cannot be shared with collaborators). Claude Code does NOT auto-discover the rules at this unusual path. Read them explicitly at the start of every session and any time you open a related file: +> +> - [`.github/.claude/rules/proteus-tests.md`](.claude/rules/proteus-tests.md) -- test quality deep-dive: anti-happy-path patterns, discriminating-value guards, physics-invariant tiering, validation certification markers, adversarial-review trigger. **Required reading before editing any file under `tests/**` or `src/proteus/**`.** +> - [`.github/.claude/rules/proteus-code-review.md`](.claude/rules/proteus-code-review.md) -- review-pass gate, domain-aware physics review (Stefan-Boltzmann, hf_row save/restore, IC consistency, whole-element aggregation symmetry, etc.). **Required reading before any code review pass.** +> +> These two files plus this one are the canonical sources of truth for testing rigor and review criteria. Together they enforce PROTEUS's extreme-rigor stance on physics validity, anti-happy-path testing, and validation certification. + +1. **Always** read the two rule files above plus the testing standards in this document and `docs/How-to/test_infrastructure.md` before any code change. +2. **Always** inform the user that you are reading in this file by printing a message at the start of your response: "(Read in copilot-instructions.md...)" +3. When creating a PR, **always** follow the PR template and ensure all sections are filled out with relevant information. +4. **Claude-specific**: `CLAUDE.md` is a symlink to this file. Session learnings, plans, and memories live in `~/.claude/projects//memory/` (per the global voice rule); they do NOT live in this repository. ## Ecosystem Structure @@ -147,6 +155,7 @@ pre-commit install -f - All Python submodules should be installed as editable (`-e`) for development - After installation, reload shell: `source ~/.bashrc` or `conda activate proteus` - After each file change or edit, ruff format all changed files with `ruff check --fix ` and `ruff format --check` +- **Parallel tracks**: one conda env per git worktree. `conda create --clone` hardlinks pip-editable pointers, so a subsequent `pip install -e .` in one env can silently repoint another env's `import proteus`. Canary before each A/B run: `python -c "import proteus; print(proteus.__file__)"`; recipe in `~/.claude/memory/conda_env_split_pattern.md`. ### Build Commands @@ -158,6 +167,15 @@ pre-commit install -f **Always run** `pip install -e ".[develop]"` after code changes to update installation. +#### SOCRATES build flags (Aragog reproducibility) + +For bit-reproducibility (paper plots, CHILI, SPIDER-parity) edit +`SOCRATES/make/Mk_cmd`: replace `FORTCOMP ... -Ofast -march=native` with +`-O2 -fno-fast-math` and clear `OMPARG = -fopenmp`. The upstream flags +produce ULP-level non-determinism that AGNI's Newton solver amplifies +into 1-2 % F_atm variance, which the Aragog hardening stack absorbs but +still leaves runs non-bit-identical. Rebuild with `cd socrates && ./build_code`. + ### Test Commands **Run all tests**: @@ -188,10 +206,13 @@ coverage report coverage html ``` -**Coverage thresholds** (in `pyproject.toml`): +**Coverage thresholds** (in `pyproject.toml`; auto-ratcheting, never manually decreased): -- Fast gate: `[tool.proteus.coverage_fast] fail_under = 44.45` -- Full suite: `[tool.coverage.report] fail_under = 59` +- Fast gate (`[tool.proteus.coverage_fast]`, unit + smoke, every PR): ratcheting toward **90%** (the PROTEUS-ecosystem ceiling). Expected to plateau in the 60-75% band; unit tests alone cannot exercise wrapper paths that require real binaries. +- Full gate (`[tool.coverage.report]`, unit + smoke + integration + slow, nightly): ratcheting toward **90%**. +- Estimated total (PR unit/smoke union with nightly artifact, every PR): compared against the full gate. This is the 90% KPI; unit tests alone cannot reach it, the nightly tier fills the wrapper / binary code paths. + +See the `Coverage architecture` block in the Testing Standards section below for the contract. **Validate test structure**: @@ -225,7 +246,7 @@ pre-commit install -f **CI runs on PRs** (`.github/workflows/ci-pr-checks.yml`): -1. **Unit tests**: `pytest -m "unit and not skip" --cov=src --cov-fail-under=44.45` +1. **Unit tests**: `pytest -m "unit and not skip" --cov=src --cov-fail-under=$FAST_COV_FAIL_UNDER` (CI reads the current fast threshold from `pyproject.toml` `[tool.proteus.coverage_fast]`; auto-ratcheted, capped at 90%) 2. **Smoke tests**: `pytest -m "smoke and not skip"` 3. **Lint**: `ruff check src/ tests/` and `ruff format --check src/ tests/` 4. **Diff-cover**: 80% coverage on changed lines (enforced) @@ -242,7 +263,7 @@ pre-commit install -f - `proteus.py` - Core `Proteus` class - `doctor.py` - Environment diagnostics (`proteus doctor`) - `config/` - Configuration system (TOML parsing, validation) - - `atmos_clim/`, `atmos_chem/`, `escape/`, `interior/`, `outgas/`, `observe/`, `orbit/`, `star/` - Physics module wrappers + - `atmos_clim/`, `atmos_chem/`, `escape/`, `interior_struct/`, `interior_energetics/`, `outgas/`, `observe/`, `orbit/`, `star/` - Physics module wrappers - `utils/` - Utilities (data, logging, plotting, helpers) - `grid/`, `inference/`, `plot/` - Specialized functionality @@ -280,40 +301,167 @@ pre-commit install -f ## Testing Standards -**Structure**: Tests MUST mirror source exactly. `src/proteus/config/_config.py` → `tests/config/test_config.py` +PROTEUS is scientific simulation code, so the test suite is held to physics-grade rigor. The rules below are the contract; the deep-dive (anti-happy-path patterns, discriminating-value guards, certification markers, adversarial-review trigger) lives in [`.github/.claude/rules/proteus-tests.md`](.claude/rules/proteus-tests.md). Read that file before editing any test file or any source file under `src/proteus/**`. The two files must be kept in sync; if you change one, mirror the change in the other. + +### Structure + +- Tests MUST mirror source exactly: `src/proteus//.py` -> `tests//test_.py`. Validated by `bash tools/validate_test_structure.sh`. +- Framework: `pytest` exclusively in the `tests/` directory. +- Shared fixtures: `tests/conftest.py` (parameter classes `EarthLikeParams`, `UltraHotSuperEarthParams`, `IntermediateSuperEarthParams`; config paths `config_earth`, `config_minimal`, `config_dummy`, ...). Read `tests/conftest.py` before writing new tests. + +### Markers and the module-level marker rule + +Tier markers, with their CI surface and per-test wall-time budgets: + +| Marker | What it tests | Speed budget | When CI runs it | +|---|---|---|---| +| `@pytest.mark.unit` | Python logic, heavy physics mocked | < 100 ms per test | Every PR (`unit and not skip`) | +| `@pytest.mark.smoke` | Real binaries, 1 timestep, low resolution | < 30 s per test | Every PR (`smoke and not skip`) | +| `@pytest.mark.integration` | Multi-module coupling | Minutes per test | Nightly only | +| `@pytest.mark.slow` | Full physics validation | Up to hours per test | Nightly only (targeted file list) | +| `@pytest.mark.skip` | Placeholder, deliberately disabled | n/a | Never | + +**Mandatory module-level marker** (no exceptions): every test file begins with + +```python +pytestmark = [pytest.mark., pytest.mark.timeout()] +``` + +with timeouts: 30 s for unit, 60 s for smoke, 300 s for integration, 3600 s for slow. Per-function markers are additive but do not replace the module-level marker. CI runs `pytest -m "unit and not skip"`; tests without a tier marker are invisible to CI. The `pytest-timeout` ceiling is a defensive net against future regressions that introduce a hang; current budgets are well clear of it. + +### Physics validity (tiered) + +Every unit test on a **physics module** must assert at least one of the following invariants. Physics modules are: `interior_struct/*`, `interior_energetics/*`, `atmos_clim/*`, `atmos_chem/*`, `escape/*`, `outgas/*`, `orbit/*`, `star/*`, `observe/*`, `inference/objective.py`, `inference/BO.py`, `inference/async_BO.py`. Helpers under physics directories (e.g. `escape/common.py`, `outgas/common.py`) are physics-required when their outputs feed physics; the utility exemption applies only to pure structural plumbing (logging, path resolution, type coercion with no physical quantity). See `.github/.claude/rules/proteus-tests.md` section 3 for the source-file-purpose carve-out. + +- **Conservation**: mass closure (sum of reservoirs = total), energy balance (LHS = RHS within tolerance), angular-momentum conservation. +- **Positivity / boundedness**: T > 0 K, P > 0 Pa, mass fractions in [0, 1], escape rate <= atmospheric mass, outgassing >= 0, melt fraction in [0, 1]. +- **Monotonicity or symmetry**: e.g. P increasing with depth implies rho increasing; reversing time integration recovers the initial condition. +- **Pinned numeric value with a discrimination guard**: a closed-form value pinned via `pytest.approx`, accompanied by an explicit assertion that the most plausible wrong-formula result would differ from the correct one by more than the tolerance. The deep-dive file has examples. + +Utility modules (`utils/*`, `config/*`, `plot/*`, `cli.py`, `inference/utils.py`, `tools/*`) are **exempt** from the physics-invariant requirement but still subject to the anti-happy-path rules (edge case, adversarial input handling where applicable, non-trivially-derivable assertion values). + +Tag every test that asserts a physical invariant with `@pytest.mark.physics_invariant` so coverage of physics-invariant tests can be tracked separately from line coverage. The marker is per-function, NOT module-level: structural tests (ordering, autonomy, mutation-in-place, pass-through assignment) in a physics-module test file should NOT carry the marker. Reference-pinned tests carry both `@pytest.mark.reference_pinned` and `@pytest.mark.physics_invariant` (the published-value pin is itself the invariant). Granularity of the reference-pinned requirement is **per source file**, not per directory: `interior_energetics/aragog.py` and `interior_energetics/spider.py` each need their own pinned test, even though they live in the same directory. Per-source-file inventory is tracked in `docs/Validation//.md`; the `tools/check_test_quality.py --reference-pinned-audit` command currently reports at directory granularity and may lag the per-file contract. + +### Anti-happy-path rules (every new test) + +Every new test function MUST include: + +1. **At least one edge case** (boundary value, empty input, extreme physical parameter). +2. **At least one path that exercises the error contract**: a documented exception path, a guard return, or a graceful clamp. If the function under test has no validation logic, exercise the limit-input behavior (e.g. `e = 0` for an eccentricity-dependent routine) and assert the mathematical invariant. +3. **Assertion values that are NOT trivially derivable from the implementation**: discriminating numeric pins (not `T = 1` where every exponent gives 1), property-based assertions (monotonicity, conservation) preferred over point checks. + +**Forbidden patterns** (these will be flagged by `tools/check_test_quality.py`): + +- Single-assert test functions. +- Standalone weak assertions: `assert result is not None`, `assert result > 0`, `assert len(result) > 0`, `assert isinstance(result, dict)` as the only meaningful check. +- Tests with no function-level docstring. +- Tests using `==` adjacent to float literals. +- Tests asserting on a fixture's implicit default (e.g. `assert fixture is None` where the fixture returns `None` implicitly): a trivially-true test is worse than no test. + +### Validation certification (marker policy) + +Two markers track validation quality independently of line coverage: + +- `@pytest.mark.physics_invariant` -- this test asserts at least one of the four invariants above. Every physics-module test should carry this marker if it qualifies. +- `@pytest.mark.reference_pinned` -- this test pins behavior against a **published benchmark** (cite the paper, figure, table), an **analytical limit** (e.g. the Stefan-Boltzmann black-body limit), or a **cross-implementation cross-check** (e.g. SPIDER vs Aragog at the same IC). Each physics module under `interior_*`, `interior_energetics/*`, `interior_struct/*`, `atmos_*`, `escape/*`, `outgas/*`, `orbit/*`, `star/*` must contain at least one `reference_pinned` test. Module-level inventory tracked in `docs/Validation/.md`. + +The new markers are registered in `pyproject.toml`. They do not gate CI by themselves; they are tracked via `tools/check_test_quality.py` for visibility. + +### Float and numerical comparison + +- NEVER use `==` for floats. Use `pytest.approx(val, rel=1e-5)` (or `abs=...`) or `np.testing.assert_allclose(actual, expected, rtol=..., atol=...)`. +- State the tolerance rationale in a comment when the choice is non-obvious (e.g. "rtol=1e-3 because Cp lookup truncates to 4 sig fig"). +- For pinned numeric values, include a **discrimination guard**: a follow-up `assert` showing the wrong-formula value would differ from the correct one by more than the tolerance. See `.github/.claude/rules/proteus-tests.md` Section 2 for the canonical pattern. + +### Mocking discipline + +- Default to `unittest.mock` for ALL external calls in unit tests: SOCRATES, AGNI, SPIDER, file I/O, network, Aragog/Zalmoxis solvers. +- Mock at the narrowest scope: a specific function, not a whole module. +- A mocked physics function must return **physically plausible** values; a mock that returns `0.0` or `1.0` for everything can mask real bugs. +- NEVER mock the function under test. +- Smoke tests use real binaries; integration tests use real submodules. + +### Optional-dependency imports -**Framework:** Use `pytest` exclusively in the `tests/` directory. +Any test that imports an optional dependency (`hypothesis`, `boreas`, `atmodeller`, `lovepy`, `mors`, `vulcan`, also `zalmoxis` when not installed via editable) MUST call `pytest.importorskip('')` at module top. The PR Docker image is built with `pip install --no-deps`; tests that import optional deps unconditionally will fail to collect on CI even though they run locally. This trap has recurred multiple times and is now lint-enforced. -**Markers** (use consistently): +### Module-level constants and `monkeypatch` -- `@pytest.mark.unit` - Fast Python logic tests (<100ms, mock heavy physics) -- `@pytest.mark.smoke` - Real binary validation (1 timestep, <30s) -- `@pytest.mark.integration` - Multi-module coupling -- `@pytest.mark.slow` - Full physics validation (hours) +When the source under test reads an environment variable into a module-level constant at import time, e.g. -**Rules**: +```python +FWL_DATA_DIR = Path(os.environ.get('FWL_DATA', ...)) +``` + +`monkeypatch.setenv` is **not sufficient**: the constant is frozen at the import that already happened. Patch the constant directly: + +```python +monkeypatch.setattr('proteus.utils.data.FWL_DATA_DIR', tmp_path, raising=False) +``` + +Patch BOTH the env var (for downstream code that re-reads it) AND the constant (for code that reads only the constant). + +### Voice rule for test artifacts + +The repo-wide voice rule (zero AI-process disclosure in any public artifact, see top of this file) applies to test code with the same strictness as to source. The rule is scoped to artifacts other contributors and external readers see: test-skip reasons, test-file/function docstrings, test-function/class names, parametrize ids, log-capture assertions, **commit messages on test-touching commits, pull-request titles and bodies on test-touching PRs**, GitHub Actions job/step names, inline `src/proteus/**` comments, and shipped log strings. Out of scope: the rule documents themselves (this file, `.github/.claude/rules/proteus-tests.md`, `.github/.claude/rules/proteus-code-review.md`, `docs/How-to/test_*.md`) may legitimately name the procedures they define. Banned phrases inside in-scope artifacts: "audit", "review pass", "adversarial review", "Phase X" (AI-roadmap labels), "T1.x", "Group A/B/C/D" (AI work groups), `claude-config/...` paths, "Generated with Claude", em-dashes, en-dashes (except bibliographic page ranges). Write the OUTCOME, never the PROCESS. + +### Speed and determinism + +- Unit tests: < 100 ms wall-time each. The 30 s `timeout` is a defensive ceiling, not the target. +- Aggressively mock heavy simulations, file I/O, and external APIs in unit tests. +- Set seeds for any randomness: `np.random.seed(42)`, `torch.manual_seed(42)`, `random.seed(42)`. All three must be seeded if all three are exercised; deterministic-only-on-one is a known regression vector. +- Use `tmp_path` (pytest fixture) for temporary files; do not produce large outputs in tests. + +### Documentation per test + +- File-level docstring: name the module under test, list the invariants and contract clauses the file exercises, and link to the three test docs (`test_infrastructure.md`, `test_categorization.md`, `test_building.md`). +- Function-level docstring: state the physical scenario or contract clause being verified, in plain language. Required (lint-enforced). +- Inline comments: explain **why** a specific input range was chosen ("T=300 K and T=1500 K so the T**3 vs T**4 difference is resolved well above tolerance"). + +### Independent review trigger + +A pull request that adds or substantially modifies > 50 lines of test code across all its commits triggers an independent review pass before merge. The denominator is PR-level (`git diff origin/main...HEAD -- 'tests/**'`), not per-commit; splitting into many sub-50-line commits does not dodge the trigger. The reviewer cites the anti-happy-path rule, the discrimination-guard requirement, and the physics-invariant tier; flags single-assert tests, weak `is not None` patterns, missing module-level marker, missing `physics_invariant` tag on a physics-module test, and dead tests (tests that pass for the wrong reason). + +### Tooling + +- Validate test structure: `bash tools/validate_test_structure.sh` +- Test-quality lint (anti-happy-path, marker, weak assertions): `python tools/check_test_quality.py --check` +- Baseline (run after a deliberate sweep): `python tools/check_test_quality.py --baseline` +- Coverage analysis: `bash tools/coverage_analysis.sh` +- Format: `ruff format src/ tests/` +- Lint: `ruff check src/ tests/` + +### Coverage architecture -- **Mocking:** Default to `unittest.mock` for ALL external calls (SOCRATES, AGNI, file I/O, network) in unit tests. Only use real calls if explicitly requested for integration tests. -- **Speed:** Unit tests must run in <100ms. Aggressively mock heavy simulations, I/O, and external APIs. -- **Floats:** NEVER use `==` for floats. Use `pytest.approx(val, rel=1e-5)` or `np.testing.assert_allclose`. -- **Physics:** Use physically valid inputs (T > 0K, P > 0) unless testing error handling. Add comments explaining *why* a specific input range was chosen (e.g., "Temperature set to 300K to represent habitable zone conditions"). -- **Context:** Always read `tests/conftest.py` before writing tests to use existing fixtures. -- **Parametrization:** Prefer `@pytest.mark.parametrize` over writing multiple similar test functions. -- **Documentation:** Add docstrings to each test explaining the physical scenario. In test file headers, include an overview and links to `docs/How-to/test_infrastructure.md`, `docs/How-to/test_categorization.md`, and `docs/How-to/test_building.md`. -- **Coverage Tool:** `pytest --cov` (convenient) or `coverage run -m pytest` (matches CI). Both work correctly. -- **Formatting:** Ruff format all test files before committing. +PROTEUS uses two gates with explicit sub-targets: -### Coverage Requirements -- **Threshold:** Check `pyproject.toml` [tool.coverage.report] `fail_under` for current threshold. -- **Automatic Ratcheting:** Coverage threshold automatically increases on main branch via `tools/update_coverage_threshold.py` (never decreases). -- **Reports:** Run `pytest --cov --cov-report=html` and inspect `htmlcov/index.html` for gaps. -- **Analysis:** Use `bash tools/coverage_analysis.sh` to identify low-coverage modules needing tests. -- **Quality Gate:** All PRs must pass the coverage threshold defined in CI (see `.github/workflows/proteus_test_quality_gate.yml`). +| Gate | Tests included | Target | Enforced | +|---|---|---|---| +| Fast gate (`tool.proteus.coverage_fast.fail_under`) | unit + smoke | Ratcheting toward **90%** (expected plateau ~60-75%) | Every PR | +| Estimated total (PR unit/smoke union with latest nightly artifact) | unit + smoke + integration | **90%** (the PROTEUS-ecosystem ceiling) | Every PR | +| Full gate (`tool.coverage.report.fail_under`) | unit + smoke + integration + slow | **90%** | Nightly only | +| Diff-cover | changed lines only | 80% (hard-coded) | Every PR | + +**What this means for contributors**: 90% is the PROTEUS-ecosystem coverage ceiling and it applies EVERYWHERE. Both gates ratchet toward 90, capped at 90 (`tools/update_coverage_threshold.py` enforces `ECOSYSTEM_CEILING = 90.0`); neither gate ratchets above the ceiling and neither may be manually decreased. Unit tests alone are not expected to actually REACH 90 because wrapper code that requires real binaries (SOCRATES, AGNI, SPIDER) is exercised by smoke + integration tests in nightly; the fast gate will plateau wherever unit + smoke coverage actually lands (typically 60-75%). The 90% target is reached via the estimated-total: the PR's unit/smoke coverage is unioned with the latest nightly artifact and compared against the full gate. + +Reports: `pytest --cov=src --cov-report=html` and open `htmlcov/index.html`. Module-level analysis: `bash tools/coverage_analysis.sh`. Diff-cover reasoning is documented in `docs/How-to/test_infrastructure.md`. ## Safety & Determinism - **Randomness:** Explicitly set seeds (e.g., `np.random.seed(42)`) in tests. - **Files:** Do not generate tests that produce large output files (unless explicitly instructed); use `tempfile` or mocks. +## Verification and Diagnostic Plots + +When testing new routines, reviewing behavior, or investigating edge cases across any PROTEUS ecosystem module: + +- **Always produce plots** that verify the requested behavior. Plots are the primary verification artifact for scientific simulation code. +- **Store all generated plots and data in gitignored folders.** Use `output_files/` (already in `.gitignore`). Never commit generated plots or simulation output to the repository. +- **Store raw simulation data** alongside plots (same gitignored folder) when feasible (up to a few hundred MB). Formats: `.txt`, `.csv`, or `.npz`. This allows replotting without re-running. +- **Store plot-generating scripts in gitignored folders** unless the user explicitly asks to commit them. If committing, place in `src/tests/`. +- **At the end of a plotting task**, report: (1) output folder path, (2) what each plot shows, (3) notable findings or anomalies. +- **Plot standards**: matplotlib with Wong colorblind-friendly palette, sans-serif font (Helvetica/Arial), inward ticks on all sides, `dpi >= 150`, clear axis labels with units, legends, descriptive titles. +- **Documentation images**: use AVIF format (not PNG) for all plots committed to `docs/assets/`. AVIF is 3-5x smaller than PNG at equivalent quality. Convert with `magick input.png -quality 60 output.avif`. Reference in markdown as `![alt](path.avif)`. + ## Code Quality **Style** (enforced by ruff): @@ -323,10 +471,25 @@ pre-commit install -f - Variables/functions: `snake_case` - Constants: `UPPER_CASE` - Type hints: Standard Python type hints -- Docstrings: Brief descriptions of physical scenarios. These include "Parameters" and "Returns" sections for functions. The variables are then included as bullet points. +- Docstrings: Brief descriptions of physical scenarios **Pre-commit**: Runs `ruff check` and `ruff format` automatically. Fix issues before committing. +### Code organization + +PROTEUS is edited by many contributors in parallel; organise code so changes +stay local. Full conventions: `docs/How-to/development_standards.md`. + +- Files: aim < 500 lines; split past ~800 along concern boundaries. +- Functions/methods: aim < 50 lines; extract helpers past ~80. Express long + orchestration as named stage functions, not one inline body. +- New backend: add a new `.py` plus a dispatch branch in `wrapper.py`; + never append a second backend into an existing backend file. +- Central registries (output-schema keys, config fields): one entry per line, + trailing comma, grouped by module, alphabetical within group. +- Add to shared files narrowly: a stage function over an inline edit; a column + in its module's group over the end of the global list. + ## Common Workflows ### Making a Code Change @@ -373,12 +536,22 @@ pytest --pdb # Drop into debugger on failure ## Important Notes -- **Docker CI**: Uses pre-built image `ghcr.io/formingworlds/proteus:latest`. PR code is overlaid, only changed files recompiled. -- **Coverage ratcheting**: Thresholds auto-increase when coverage improves (committed by `github-actions[bot]`). Never manually decrease. +- **CI caching**: ubuntu-latest + macos-latest runners with `actions/cache` for SOCRATES build, Julia depot, FWL_DATA, AGNI clone, pip wheels. Composite action `.github/actions/setup-proteus` handles platform-aware setup. Cache keys derive from `[tool.proteus.modules]` in pyproject.toml plus `.github/data-manifest.yaml`. +- **Coverage ratcheting**: Thresholds auto-increase when coverage improves (committed by `github-actions[bot]`), capped at the 90% PROTEUS-ecosystem ceiling. Never manually decrease. - **Test placeholders**: Some tests marked `@pytest.mark.skip` are placeholders. Excluded from CI. - **Windows**: Not supported. Linux/macOS only. - **Python version**: Must be 3.12 (PETSc/SPIDER require Python <= 3.12). +## Whole-planet oxygen accounting (issue #677) + +Every config must declare an explicit `planet.elements.O_mode`. Four valid modes: + +- `"ic_chemistry"`: defer the IC O budget to CALLIOPE's fO2-buffered equilibrium. Preserves pre-fix behaviour; backwards-compatible. +- `"ppmw"`, `"kg"`: parallel to the H/C/N/S modes; sets O_kg directly. +- `"FeO_mantle_wt_pct"`: alternative unit for petrologists. The number is interpreted as `O_kg = M_mantle * (wt% / 100) * (M_O / M_FeO)`. The mantle EOS density is NOT modified; PALEOS still assumes its built-in FeO content. The mode is a unit-of-convenience for setting the volatile-O budget in familiar terms. + +Under D1A (the chosen design), CALLIOPE / atmodeller chemistry is unchanged. Oxygen is treated as a buffered element at the chemistry step but a tracked element in PROTEUS-side mass accounting. The asymmetry that previously let `M_atm > M_planet` at high H budgets is closed by including O in M_ele, in the Zalmoxis dry-mass subtraction, in the proportional escape distribution, and in the desiccation gate. Escape includes O in the unfractionated partitioning so `sum(esc_rate_e) == esc_rate_total` to within rounding. The runtime invariant `M_atm <= M_planet` is enforced via `assert_mass_conservation` in the main loop. An IC consistency check (`check_ic_oxygen_budget`, called once after the first outgas call) hard-fails on >50% divergence between user-supplied O_budget and CALLIOPE's equilibrium value. + ## Documentation References - **Testing**: `docs/How-to/test_infrastructure.md`, `docs/How-to/test_building.md`, `docs/How-to/test_categorization.md` @@ -387,41 +560,19 @@ pytest --pdb # Drop into debugger on failure - **Docs development**: `docs/How-to/documentation.md` (build/serve with `zensical serve`) - **Copilot guidelines**: `.github/copilot-instructions.md` (this file; applies to all ecosystem modules) -## 🧠 Memory Maintenance - -### Prime Directive: Keep Project Memory Current - -**ALWAYS** update `.github/copilot-memory.md` after making significant architectural changes, adding new libraries, or finalizing a key design decision. - -**What to record**: -- The change made and the *reasoning* (the "Why") behind it -- New architectural decisions (ADRs) with context -- Major refactorings or infrastructure changes -- Lessons learned from debugging or CI/CD issues -- Updates to active context (current sprint focus) -- New dependencies or ecosystem module changes +## Project memory and session learnings -**When to record**: -- Immediately after implementing architectural changes -- After resolving complex bugs (capture the lesson) -- When adding/removing major dependencies -- After CI/CD workflow modifications -- When establishing new coding patterns or standards +Session-specific knowledge (debugging logs, design rationale, sprint focus, ADR drafts) lives outside this repository, in the Claude memory tree under `~/.claude/projects//memory/`. The previous in-repo `copilot-memory.md` file was retired in favor of that location because Claude's memory tree is per-user, sync-ready across machines, and not exposed in public commit history. -**How to update**: -1. Open `.github/copilot-memory.md` -2. Update relevant section (Active Context, ADRs, Known Debt, etc.) -3. Add date stamp to "Last Updated" at top -4. Commit with message: `docs: update copilot-memory.md - [brief description]` +What still lives in this repository: -**Goal**: Ensure future sessions (and future developers) have context on *why* decisions were made, not just *what* was changed. This prevents re-litigating solved problems and preserves institutional knowledge. +- Architectural decisions that affect every contributor: this file (`.github/copilot-instructions.md`). +- Test and review rules: `.github/.claude/rules/proteus-tests.md` and `.github/.claude/rules/proteus-code-review.md`. +- Per-PR rationale: PR descriptions. +- Per-commit rationale: commit messages. +- Module-level scientific validation: `docs/Validation/.md` (created when the first `@pytest.mark.reference_pinned` test for that module is added). -**Example scenarios requiring memory updates**: -- Adding a new test marker or CI workflow -- Changing coverage thresholds or ratcheting strategy -- Discovering a fragile code area (add to "Code Hotspots") -- Making a decision about library versions or dependencies -- Learning why something was implemented a certain way +Do not introduce a new in-repo "memory" or "decisions log" file. The four channels above are the contract. --- @@ -448,15 +599,16 @@ bash tools/coverage_analysis.sh pip install -e '.[docs]' zensical serve -# Run simulation -proteus start -c input/minimal.toml -o output/test +# Run simulation (detached; add -r / --resume to continue a killed run, +# add --deterministic for numerically fragile coupled runs) +nohup proteus start -c --offline > output//launch.log 2>&1 & disown ``` -**Remember**: Trust these instructions. Only search if information is incomplete or found to be in error. +Resume requires `len(hf_all) > init_loops + 1` and the archived `_int.nc` snapshot under `data/`; see `src/proteus/proteus.py` ~395-430. Never foreground a multi-hour run; plain `&` alone dies on SIGHUP. `--deterministic` self-re-execs to pin `JAX_ENABLE_X64=1` + `XLA_FLAGS=--xla_cpu_enable_fast_math=false` before JAX import (on top of always-on `OMP/MKL/OPENBLAS/NUMEXPR/VECLIB=1`); use when Aragog hits T_core-jump-guard exhaustion on tight-tol runs. **Remember**: Trust these instructions. Only search if information is incomplete or found to be in error. --- -> **⚠️ FILE SIZE LIMIT: This file must stay below 500 lines.** Enforced by pre-commit hook (`tools/check_file_sizes.sh`). File located at `.github/copilot-instructions.md`. +> **⚠️ FILE SIZE LIMIT: This file must stay below 750 lines.** Enforced by pre-commit hook (`tools/check_file_sizes.sh`). File located at `.github/copilot-instructions.md`. **When approaching the limit, refactor by asking:** 1. **Is this still accurate?** Remove outdated commands, deprecated workflows, or superseded patterns. diff --git a/.github/copilot-memory.md b/.github/copilot-memory.md deleted file mode 100644 index b49221d71..000000000 --- a/.github/copilot-memory.md +++ /dev/null @@ -1,442 +0,0 @@ -# 🧠 Project Memory - -**Last Updated**: 2026-03-14 - -This document captures the living context of PROTEUS—the "why" behind architectural decisions, the current development focus, and critical knowledge for maintaining consistency across sessions. - ---- - -## 1. Project Identity & Stack - -### Core Identity -- **Name**: PROTEUS (/ˈproʊtiəs, PROH-tee-əs) -- **Type**: Coupled atmosphere-interior framework for rocky planet evolution -- **Philosophy**: Modular, adaptable scientific simulation inspired by the Greek god of elusive sea change -- **Version**: 25.11.19 (CalVer: YY.MM.DD) -- **License**: Apache 2.0 - -### Primary Technology Stack -- **Languages**: Python 3.12 (primary), Julia, Fortran, C -- **Python Framework**: setuptools-based package (`fwl-proteus`) -- **Testing**: pytest with markers (unit, smoke, integration, slow) -- **Coverage**: coverage.py with automatic ratcheting (current: 59% full, 44.45% fast) -- **Linting**: ruff (line-length 96, quote-style single) -- **CI/CD**: GitHub Actions with Docker-based workflows -- **Documentation**: Zensical (wraps MkDocs Material; serve with `zensical serve`, NOT `mkdocs serve`) - -### Key External Dependencies -- **SOCRATES** (Fortran): Spectral radiative transfer code -- **AGNI** (Julia): Radiative-convective atmospheric energy module -- **SPIDER** (C): Interior thermal evolution (T-S formalism, requires PETSc) -- **PETSc**: Numerical computing library (specific OSF version, not built from source) - -### Python Ecosystem Modules (Editable Installs) -- **CALLIOPE**: Volatile in-/outgassing and thermodynamics -- **JANUS**: 1D convective atmosphere module -- **MORS**: Stellar evolution module -- **ARAGOG**: Interior thermal evolution (T-P formalism) -- **ZEPHYRUS**: Atmospheric escape module -- **BOREAS**: Hydrodynamic atmospheric escape module -- **ZALMOXIS**: Interior structure solver (hydrostatic equilibrium, EOS) - -### Environment Requirements -- **Python**: 3.12 (strict requirement for PETSc/SPIDER compatibility) -- **Platforms**: Linux/macOS only (Windows not supported) -- **Disk Space**: ~20 GB -- **Critical Env Vars**: `FWL_DATA`, `RAD_DIR`, `PETSC_DIR`, `PETSC_ARCH` - ---- - -## 2. Active Context (The "Now") - -### Current Focus (as of 2026-03-14) - -**Recently Merged (since 2026-02-13)**: -1. **PR #648: Zalmoxis-SPIDER coupling** (merged 2026-03-10) - External mesh from Zalmoxis structure solver passed to SPIDER thermal evolution. Physics-based structure update triggers. ~1,900 lines coupling code, ~3,300 lines tests. -2. **PR #596: VULCAN online chemistry** (merged) - In-loop chemical kinetics via `src/proteus/atmos_chem/`. VULCAN now runs at every snapshot, not just post-processing. -3. **PR #642: Download bug fixes** (merged 2026-03-08) - New `download_zenodo_file()` for single-file downloads, PHOENIX spectrum unzipping, custom stellar spectrum path expansion. -4. **PR #643: PETSc/SPIDER macOS 26+ fix** (merged 2026-02-23) - CFLAGS quoting fix for RHEL 9 / Rocky Linux. -5. **PR #630: CI improvements** (merged 2026-02-16) - macOS unit tests in PR checks, nightly deduplication, Codecov aggregate coverage. - -**Open PRs**: -- **PR #654**: Zensical documentation style update (branch: `ks/docs`) - Diataxis restructuring, new landing page, custom CSS -- **PR #634**: Albedo feature (branch: `lj/albedo`) - -### Recent Architectural Changes -- **Zalmoxis-SPIDER coupling**: External mesh mode lets Zalmoxis compute density structure, SPIDER evolves thermal state on that grid. Physics-based update triggers (dT/T, dPhi, floor/ceiling). Config: `struct.module = "zalmoxis"`. -- **VULCAN online chemistry**: `atmos_chem/` module runs VULCAN in-loop at every snapshot. Produces `vulcan_