Skip to content

ecCKD/ecRad radiation platform and validation harness#8

Open
glwagner wants to merge 10 commits into
mainfrom
glw/ecckd-radiation-platform
Open

ecCKD/ecRad radiation platform and validation harness#8
glwagner wants to merge 10 commits into
mainfrom
glw/ecckd-radiation-platform

Conversation

@glwagner

@glwagner glwagner commented May 21, 2026

Copy link
Copy Markdown
Member

Summary

Renames and rebuilds AnalyticBandRadiation.jl into the standalone, ecRad/ecCKD-compatible radiation package NumericalRadiation described in radiative_heating.md: staged runtime API, ecCKD ingestion + forward gas-optics, cloudless/cloud-overlap solvers, cloud + aerosol optics scaffolding, NCDatasets and RRTMGP extensions, and an Artifacts.toml wiring for official ecRad/ecckd data. Adds an executable validation/ audit harness — accuracy gates against ecRad references, reduced-model search/optimizer artifacts, RRTMGP comparison, AD calibration, ecCKD recovery preflights, plus prompt-to-artifact and recovery-goal audits writing paired .json + .md to validation/results/. Removes the in-repo Breeze extension and lands an independent benchmarking/ project here that path-deps onto a developing Breeze checkout for the production H100 benchmark.

Full narrative, status against each gate, and step-by-step reproduction are in PR_WORK_SUMMARY.md.

Package rename history

The Julia package has been renamed twice in this PR. UUID cd8119b0-1744-44d6-9ede-6ad1ad750b26 is stable across both renames, so downstream environments only need to update the package name in their manifests.

  1. AnalyticBandRadiationLightflux
  2. LightfluxNumericalRadiation (current)

The GitHub repository URL (github.com/NumericalEarth/AnalyticBandRadiation.jl) reflects the original name and has not been renamed on GitHub yet. The Julia package name in Project.toml is now NumericalRadiation. Source/extension paths:

  • src/NumericalRadiation.jl (was src/Lightflux.jl, was src/AnalyticBandRadiation.jl)
  • ext/NumericalRadiation{NCDatasets,RRTMGP,SpeedyWeather}Ext.jl

Downstream packages with manifests referencing Lightflux or AnalyticBandRadiation under UUID cd8119b0-... need to regenerate or update their manifests to pick up the new name. The Breeze rename plumbing (BreezeLightfluxExtBreezeNumericalRadiationExt, path-fallback constants in benchmarking/) lands in a separate Breeze commit.

Headline figures

Regenerated from committed artifacts by figures/make_pr_figures.jl.

Accuracy: official 32×32 ecCKD matches ecRad on the clear-sky tropical column

ecCKD 32×32 vs ecRad flux profiles

LW flux RMSE ≈ 0.003 W m⁻², SW flux RMSE ≈ 0.007 W m⁻², heating-rate RMSE ≈ 0.006 K day⁻¹. Hard ecCKD cloudless gate passes.

Performance: H100 speedup across published ecCKD k-models at production scale

H100 speedup bar chart

Measured by the new independent benchmarking/ project (path-deps onto the dev Breeze checkout) through Breeze's update_radiation! call surface. Both bars use post-warmup medians from 5 samples on a single H100 + the same 512×512×128 / 262,144-column RCEMIP-style workload.

k-model RH median [ms] RRTMGP median [ms] Speedup
32 LW × 32 SW (fsck-32b × rgb-32b) 794.7 5863.3 7.4×
32 LW × 64 SW (fsck-32b × window-64b) 1047.3 5864.2 5.6×
32 LW × 96 SW (fsck-32b × vfine-96b) 1279.3 5867.7 4.6×
64 LW × 32 SW (narrow-64b × rgb-32b) 1352.7 5869.8 4.3×
64 LW × 64 SW (narrow-64b × window-64b) 1603.6 5870.1 3.7×

RRTMGP is essentially constant — it runs its own 256-LW × 224-SW lookup independent of the ecCKD k-counts, so the dashed baseline is genuinely one number. RadiativeHeating scales sub-linearly with total g-points thanks to the new streaming kernel in BreezeRadiativeHeatingExt._tabulated_ecckd_streaming_radiation!, which fuses g-point optical depth into the column transport loop and never materializes (ngpt, Nx, Ny, Nz) four-dimensional intermediates (which at 512×512×128 × 96 g-points would have been ~26 GiB per buffer). The 64 LW × 64 SW row at 3.7× is just below the 4× gate at this scale — an honest reading of where the 64-g production path stands today.

A more recent fully-coupled H100 sweep (post-rename) is recorded in HANDOFF.md R-10: 164.4 ms/step at 100×100×74 and 2457.2 ms/step at 512×512×128, both with radiation scheduled at IterationInterval(1). That's 5.5–7.2× faster than RRTMGP coupled at the same cadence, and 52–57× slower than dynamics-only — the latter ratio is the target of the in-flight performance campaign scoped by HANDOFF.md R-7.

Training: Reactant + Enzyme calibration of ecCKD coefficients

Reactant/Enzyme training curves

Left: 13-epoch finite-difference / Enzyme-checked gradient descent on a 4-parameter toy ecCKD fixed-topology fixture — loss × 0.006, flux RMSE × 0.077, heating-rate RMSE × 0.178. Right: 8 epochs of Reactant-compiled, Enzyme reverse-mode gradient descent against the package-native RRTMGP shortwave loss for a 48-parameter 16-g model.

What's left: the reduced 16-g hard gate

Reduced 16-g optimizer descent

Worst boundary-forcing error after each accepted optimizer move in the greedy / constrained-table / slot-blend / weight-refit chain. From 0.0042 W m⁻² at the 32×32 baseline up to ≈7.2 W m⁻² for naive 16-g, then down to ≈2.14 W m⁻² after the full chain — still ≈7× above the 0.30 W m⁻² hard gate.

Acceptance gate status

Gate Status
ecRad parity (full + reduced) partial — full 32×32 passes hard cloudless and all-sky IFS gates; reduced 16-g shortwave still fails
Reduced vs. RRTMGP on representative states partial — direct package-native RRTMGP comparison emitted; reduced 16-g still fails hard accuracy
Dynamic Breeze integration + ≥4× H100 RRTMGP speedup partial — 32×32, 32×64, 32×96, 64×32 pass the 4× gate at 512×512×128 / 262K cols; 64×64 lands at 3.7× and is the next perf-work target
Reactant/Enzyme original-objective ecCKD recovery blocked — teacher-student recovery passes for all six published definitions; exact recovery is gated on locally derived ecCKD CKDMIP training flux products, generation in progress under Slurm

Notable code changes

  • src/runtime_interfaces.jl, src/abstract_types.jl: staged optics/solver/backend API.
  • src/io/ecckd_definition.jl, src/gas_optics/ecckd_forward.jl: official ecCKD schema + tabulated forward gas-optics.
  • src/io/cloud_scattering.jl, src/solvers/cloud_optics.jl, src/solvers/cloud{less,overlap}_{longwave,shortwave}.jl: cloud/aerosol scaffolding + solvers.
  • src/metrics.jl: shared validation metrics + thresholds.
  • ext/NumericalRadiationNCDatasetsExt.jl, ext/NumericalRadiationRRTMGPExt.jl: NetCDF loaders + RRTMGP comparison.
  • Artifacts.toml: pinned lazy ecrad_data and ecckd_source artifacts.
  • validation/: ~50 audit scripts + paired JSON/MD results.
  • benchmarking/: independent H100 benchmark suite (env-controlled, path-deps onto developing Breeze).
  • figures/: PR-narrative figure generator + four PNGs.
  • ext/AnalyticBandRadiationBreezeExt.jl deleted; per-package Breeze extension now lives in the Breeze checkout as BreezeNumericalRadiationExt (rename in flight there).

Test plan

  • julia --project=. -e 'using Pkg; Pkg.test()'
  • julia --project=test validation/recovery_goal_audit.jl — expect not_complete until the live derived-flux Slurm job finishes.
  • julia --project=test validation/ecrad_accuracy_gate.jl + validation/ecrad_all_sky_ifs_gate.jl — hard ecRad gates on official ecCKD.
  • julia --project=figures figures/make_pr_figures.jl — regenerate the four PR figures.
  • (H100) sbatch benchmarking/h100_kmodel_coverage.sbatch — reproduce the cross-k-model H100 sweep at 512×512×128.
  • (Heavy, optional) PR_WORK_SUMMARY.md §4.3 for the live ecCKD derived-flux generation.

🤖 Generated with Claude Code

glwagner and others added 7 commits May 21, 2026 00:24
This is a large branch that turns AnalyticBandRadiation.jl into the
standalone, ecRad/ecCKD-compatible radiation package described in
`radiative_heating.md`. See `PR_WORK_SUMMARY.md` for the full narrative
and reproduction instructions.

Highlights:

- Staged runtime API (`runtime_interfaces.jl`, abstract optics/solver/backend
  types) sits beside the existing `RadiativeTransferColumn` path so host
  models can stop at any layer.
- ecCKD ingestion (`io/ecckd_definition.jl`, `gas_optics/ecckd_forward.jl`),
  artifact-backed official ecRad + ecckd source resolution (`Artifacts.toml`),
  and NCDatasets-backed loaders behind `AnalyticBandRadiationNCDatasetsExt`.
- Cloudless and cloud-overlap longwave/shortwave solvers, cloud and aerosol
  optics scaffolding, ecRad cloud scattering table ingestion and g-point
  mapping.
- Optional `AnalyticBandRadiationRRTMGPExt` for direct package-native RRTMGP
  comparisons through `ColumnAtmosphere`/`RadiativeFluxes`.
- Validation metrics (`metrics.jl`) and an extensive `validation/` audit
  harness writing paired `.json` + `.md` artifacts under
  `validation/results/`, plus prompt-to-artifact and recovery-goal audits.
- Tests for every new surface (solvers, cloud/aerosol optics, ecCKD schema,
  RRTMGP comparison, every validation script, the full goal/recovery audits).
- Docs (Architecture, ecCKD files, CKDMIP training data) and a single-column
  staged example.
- CPU benchmark scaffold for the staged runtime path.
- Removed the in-repo Breeze extension; Breeze now owns
  `BreezeRadiativeHeatingExt` in a dedicated checkout, as recorded by the
  goal-audit artifacts.

Goal acceptance: cloudless ecRad gate, all-sky IFS gate, and the dedicated
Breeze H100 ≥4x RRTMGP performance gate pass. Exact Reactant/Enzyme
original-objective ecCKD recovery is blocked on locally derived ecCKD
CKDMIP training flux products (`5gas-*` LW, `rel-*` LW/SW); generation is
running under Slurm. See `PR_WORK_SUMMARY.md` §4.3 for reproduction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed descent)

`figures/make_pr_figures.jl` regenerates four PNGs from the committed
artifacts and dedicated-Breeze JSONs:

- fig1_ecckd_vs_ecrad_profiles.png: official 32x32 ecCKD vs ecRad
  up/down LW + SW flux profiles and heating rate on the clear-sky
  tropical column. LW flux RMSE ~0.003 W m-2, SW flux RMSE
  ~0.007 W m-2, heating-rate RMSE ~0.006 K day-1.
- fig2_h100_speedup.png: H100 RCEMIP-style 32x32x64 / 1024-column
  median radiation-update timing for RadiativeHeating.jl vs RRTMGP.jl
  across three k-model configurations (32x32 validated ecCKD, 32x16,
  16x16). Speedups 31.3x / 29.8x / 27.0x.
- fig3_training_curves.png: Reactant/Enzyme calibration. Left: 13-epoch
  toy fixed-topology ecCKD recovery loss decay (Enzyme gradient-checked).
  Right: 8 epochs of Reactant-compiled, Enzyme reverse-mode gradient
  descent on a 48-parameter 16-g RRTMGP-target shortwave loss.
- fig4_reduced_descent.png: worst boundary-forcing error vs accepted
  optimizer moves for the reduced 16-g shortwave chain, with the
  0.3 W m-2 hard-gate threshold marked. Drops from 7.18 W m-2 at the
  naive 16-g start to 2.14 W m-2 (current best) - still ~7x above
  the gate.

Figures live in figures/ with their own Project.toml. PR_WORK_SUMMARY.md
embeds them inline. figures/Manifest.toml is gitignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…del coverage

Two real issues in the original plot:

1. The three RRTMGP bars implied three RRTMGP configurations, but RRTMGP
   is the same 256 LW x 224 SW lookup in every run. The variation came
   from samples=1 vs samples=3 across the production benchmark and two
   reduced-pareto scaffolds (the single-shot scaffolds include first-call
   compile/warmup overhead in the RRTMGP timing).

2. The plot suggested we have measured H100 timings across "all" reduced
   ecCKD k-models, but in fact only 32x32 (validated ecCKD, production)
   has a defensible measurement. 32x16 and 16x16 are fixed-coefficient
   scaffolds (gas_model_source=missing). Published 64-g and 96-g ecCKD
   models in our artifact inventory have no H100 timing yet.

Redrawn as two panels:
- Left: production head-to-head, RadiativeHeating 32x32 validated ecCKD
  (7.79 ms, samples=3) vs the single RRTMGP baseline (244 ms, samples=3),
  31.3x speedup.
- Right: RadiativeHeating timing across ecCKD k-models, all referenced
  to the same 244 ms RRTMGP baseline as a horizontal dashed line.
  Bars are coloured by provenance: validated ecCKD (samples=3),
  fixed-coeff scaffold (samples=1), published ecCKD model with no
  H100 timing yet (grey). With the corrected baseline the scaffold
  speedups land at 13.3x (32x16) and 11.7x (16x16) rather than the
  warmup-inflated 29.8x / 27.0x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the inflated cross-baseline speedups (29.8x, 27.0x) with the
honest comparison against the one defensible RRTMGP baseline (244 ms,
n=3, post-warmup), and call out the published 64-g / 96-g ecCKD models
that have no H100 timing yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stands up benchmarking/ as a self-contained environment that path-deps on
this repo's AnalyticBandRadiation.jl and a developing Breeze checkout, so
the H100 RCEMIP benchmark can co-evolve with both repos without going
through Breeze's workspace machinery.

The benchmark exercises Breeze's update_radiation! entry point (so it
measures the production call path) and writes JSON in the schema
figures/make_pr_figures.jl reads.

H100 sweep at 512x512x128 / 262,144 columns, samples=5 post-warmup
median, validated ecCKD provenance for all five published k-model
combinations:

  k-model                                       RH ms   RRTMGP ms   speedup
  32 LW × 32 SW  (fsck-32b   × rgb-32b)          794.7    5863.3       7.4x
  32 LW × 64 SW  (fsck-32b   × window-64b)      1047.3    5864.2       5.6x
  32 LW × 96 SW  (fsck-32b   × vfine-96b)       1279.3    5867.7       4.6x
  64 LW × 32 SW  (narrow-64b × rgb-32b)         1352.7    5869.8       4.3x
  64 LW × 64 SW  (narrow-64b × window-64b)      1603.6    5870.1       3.7x

RRTMGP is essentially constant because it runs its own 256-LW / 224-SW
k-table regardless of the ecCKD k-counts. RadiativeHeating scales
sub-linearly with g-point count under the streaming-kernel rewrite that
just landed in BreezeRadiativeHeatingExt (no (ngpt, Nx, Ny, Nz)
intermediate optical-property buffers).

The figure now has a single defensible RRTMGP baseline (5863.3 ms,
samples=5, post-warmup) instead of three warmup-contaminated bars, and
all five published ecCKD k-models are measured rather than four shown
as "pending".

figures/make_pr_figures.jl prefers the local benchmarking/results/
artifacts and falls back to the dedicated Breeze checkout for runs not
re-measured under the streaming kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-updated by the running Slurm job that generates the 18 derived ecCKD
training flux products and re-runs the preflight + recovery audits. No
code changes; just refreshes RUNNING_REVIEW.md and the
ckdmip_training_data_preflight, ecckd_derived_flux_generation_plan, and
recovery_goal_audit JSON/MD snapshots.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renames the package, module, and extension surfaces:

  package: AnalyticBandRadiation -> Lightflux
  module:  module AnalyticBandRadiation -> module Lightflux
  ext:     AnalyticBandRadiationNCDatasetsExt    -> LightfluxNCDatasetsExt
           AnalyticBandRadiationRRTMGPExt        -> LightfluxRRTMGPExt
           AnalyticBandRadiationSpeedyWeatherExt -> LightfluxSpeedyWeatherExt

`using AnalyticBandRadiation` / `import AnalyticBandRadiation` are
updated everywhere in src/, ext/, test/, validation/, benchmark/,
benchmarking/, docs/, examples/, figures/, and the top-level markdown
notes.

GitHub URL references in the README, docs, and PR_WORK_SUMMARY are kept
literal because the GitHub repo itself is still
`NumericalEarth/AnalyticBandRadiation.jl` (the on-disk directory is also
unchanged). The `Project.toml` `name` field is now `"Lightflux"`, so the
package loads as `using Lightflux` regardless of the directory.

The companion BreezeRadiativeHeatingExt rename in the dev Breeze checkout
lands in a separate commit on branch glw/streaming-ecckd-gpoints in
NumericalEarth/Breeze.jl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
glwagner and others added 3 commits May 22, 2026 19:54
- Live audit/preflight artifacts now reflect the completed CKDMIP derived-
  flux generation: ckdmip_training_data_preflight is
  ready_for_original_ecckd_objective, ecckd_objective_reconstruction_check
  is ready_to_reconstruct_original_objective, recovery_goal_audit has
  blocked=0, partial=3.
- New validation scripts and tests for the ecCKD reduction work:
  leave-one-out heating residual localization, heating table optimizer,
  refit breakdown, single-table move scan, weight coordinate descent
  (+ continuation, + scan, + boundary polish), published model accuracy,
  and training recovery targets.
- Updated validation scripts (band accuracy pareto, published training
  manifest, recovery goal audit, official ecCKD training, reduced ecCKD
  RRTMGP comparison, reduced ecCKD accuracy) and their paired tests.
- PR_WORK_SUMMARY.md and RUNNING_REVIEW.md updated with the unblock work
  and ongoing review notes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gate

- Original-objective recovery scaffolding: ckdmip_original_objective_dataset,
  ckdmip_original_objective_ad_batch, ecckd_original_objective_terms,
  ecckd_original_objective_loss; with paired tests and result artifacts.
  Sets up the AD-batched optimizer probe against the captured published
  objective terms.
- Published-model recovery vector path:
  ecckd_published_recovery_target, ecckd_published_recovery_vector, and
  ecckd_published_recovery_vector_training -- analytic quadratic descent
  in log-vector space, 204896 params, final loss 1.6e-17, log-RMSE 1.6e-10.
- Published all-sky accuracy gate (ecckd_published_all_sky_accuracy): all
  six promoted official ecCKD combinations (32x32 / 32x64 / 32x96 /
  64x32 / 64x64 / 64x96) pass under matched Tripleclouds/aerosol config.
- ecRad reference materialization extended to the same six combinations:
  validation/reference/ecrad/ecckd_{32x64,32x96,64x32,64x64,64x96}_*.nc
  for all_sky_tropical_column, clear_sky_tropical_column, and
  rcemip_style_column_subset.
- ecRad reference optics solver gap + 32x64 gate (paired test +
  result artifacts).
- Matched reference plan artifact (ecckd_matched_reference_plan) records
  which ecCKD/ecRad combinations are inventoried and recovered.
- Updated audit artifacts: recovery_goal_audit gains 10+ new sub-statuses
  (objective_terms_captured, dataset_samples_ready, optimizer_batch_ready,
  published_recovery_target_ready, published_recovery_vector_training,
  candidate_objective_score_ready, table_writeback_* probes, written_
  coordinate_descent_improved). Best reduced candidate so far: 32x31
  (63 g-points), worst boundary forcing 0.2998 W/m^-2, passing the 0.30
  hard threshold for the first time.
- New REVIEW_HANDOFF.md summarizes the current state for the next operator.
- RUNNING_REVIEW.md and PR_WORK_SUMMARY.md updated with the unblock /
  optimization progress and ongoing review notes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Second rename in this PR's history. UUID `cd8119b0-1744-44d6-9ede-6ad1ad750b26`
unchanged across both renames so downstream consumers only need to update the
package name in their manifests.

Chain:

1. `AnalyticBandRadiation` → `Lightflux`
2. `Lightflux` → `NumericalRadiation` (this commit)

Touchpoints (66 files, +197/-192 lines):

- `src/Lightflux.jl` → `src/NumericalRadiation.jl` (file + module decl)
- `ext/Lightflux{NCDatasets,RRTMGP,SpeedyWeather}Ext.jl` →
  `ext/NumericalRadiation*Ext.jl` (file + module decl + `using`)
- `Project.toml`: name + `[extensions]` stanza targets
- `docs/Project.toml`, `docs/make.jl`, all `docs/src/**/*.md` references
- `examples/`, `test/`, `validation/`, `benchmark/`, `benchmarking/`:
  `using` and `Base.get_extension(NumericalRadiation, :NumericalRadiation*Ext)`
- User prose: `README.md`, `radiative_heating.md`, `PR_WORK_SUMMARY.md`
- `.gitignore`: add `.claude/` + `.handoff_monitor.*` (local session state)

All `Manifest.toml` files are gitignored; they regenerate on next
`Pkg.instantiate()` against the renamed `Project.toml`.

Deferred (NOT in this commit):

- On-disk directory `AnalyticBandRadiation.jl/` → `NumericalRadiation.jl/`
  (must be done outside this session; changes CWD).
- Breeze downstream rename: 12 files referencing `Lightflux` plus path-fallback
  constants and the `BreezeLightfluxExt` extension at
  `BreezeRadiativeHeatingDev/Breeze.jl/`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant