ecCKD/ecRad radiation platform and validation harness#8
Open
glwagner wants to merge 10 commits into
Open
Conversation
This is a large branch that turns AnalyticBandRadiation.jl into the standalone, ecRad/ecCKD-compatible radiation package described in `radiative_heating.md`. See `PR_WORK_SUMMARY.md` for the full narrative and reproduction instructions. Highlights: - Staged runtime API (`runtime_interfaces.jl`, abstract optics/solver/backend types) sits beside the existing `RadiativeTransferColumn` path so host models can stop at any layer. - ecCKD ingestion (`io/ecckd_definition.jl`, `gas_optics/ecckd_forward.jl`), artifact-backed official ecRad + ecckd source resolution (`Artifacts.toml`), and NCDatasets-backed loaders behind `AnalyticBandRadiationNCDatasetsExt`. - Cloudless and cloud-overlap longwave/shortwave solvers, cloud and aerosol optics scaffolding, ecRad cloud scattering table ingestion and g-point mapping. - Optional `AnalyticBandRadiationRRTMGPExt` for direct package-native RRTMGP comparisons through `ColumnAtmosphere`/`RadiativeFluxes`. - Validation metrics (`metrics.jl`) and an extensive `validation/` audit harness writing paired `.json` + `.md` artifacts under `validation/results/`, plus prompt-to-artifact and recovery-goal audits. - Tests for every new surface (solvers, cloud/aerosol optics, ecCKD schema, RRTMGP comparison, every validation script, the full goal/recovery audits). - Docs (Architecture, ecCKD files, CKDMIP training data) and a single-column staged example. - CPU benchmark scaffold for the staged runtime path. - Removed the in-repo Breeze extension; Breeze now owns `BreezeRadiativeHeatingExt` in a dedicated checkout, as recorded by the goal-audit artifacts. Goal acceptance: cloudless ecRad gate, all-sky IFS gate, and the dedicated Breeze H100 ≥4x RRTMGP performance gate pass. Exact Reactant/Enzyme original-objective ecCKD recovery is blocked on locally derived ecCKD CKDMIP training flux products (`5gas-*` LW, `rel-*` LW/SW); generation is running under Slurm. See `PR_WORK_SUMMARY.md` §4.3 for reproduction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed descent) `figures/make_pr_figures.jl` regenerates four PNGs from the committed artifacts and dedicated-Breeze JSONs: - fig1_ecckd_vs_ecrad_profiles.png: official 32x32 ecCKD vs ecRad up/down LW + SW flux profiles and heating rate on the clear-sky tropical column. LW flux RMSE ~0.003 W m-2, SW flux RMSE ~0.007 W m-2, heating-rate RMSE ~0.006 K day-1. - fig2_h100_speedup.png: H100 RCEMIP-style 32x32x64 / 1024-column median radiation-update timing for RadiativeHeating.jl vs RRTMGP.jl across three k-model configurations (32x32 validated ecCKD, 32x16, 16x16). Speedups 31.3x / 29.8x / 27.0x. - fig3_training_curves.png: Reactant/Enzyme calibration. Left: 13-epoch toy fixed-topology ecCKD recovery loss decay (Enzyme gradient-checked). Right: 8 epochs of Reactant-compiled, Enzyme reverse-mode gradient descent on a 48-parameter 16-g RRTMGP-target shortwave loss. - fig4_reduced_descent.png: worst boundary-forcing error vs accepted optimizer moves for the reduced 16-g shortwave chain, with the 0.3 W m-2 hard-gate threshold marked. Drops from 7.18 W m-2 at the naive 16-g start to 2.14 W m-2 (current best) - still ~7x above the gate. Figures live in figures/ with their own Project.toml. PR_WORK_SUMMARY.md embeds them inline. figures/Manifest.toml is gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…del coverage Two real issues in the original plot: 1. The three RRTMGP bars implied three RRTMGP configurations, but RRTMGP is the same 256 LW x 224 SW lookup in every run. The variation came from samples=1 vs samples=3 across the production benchmark and two reduced-pareto scaffolds (the single-shot scaffolds include first-call compile/warmup overhead in the RRTMGP timing). 2. The plot suggested we have measured H100 timings across "all" reduced ecCKD k-models, but in fact only 32x32 (validated ecCKD, production) has a defensible measurement. 32x16 and 16x16 are fixed-coefficient scaffolds (gas_model_source=missing). Published 64-g and 96-g ecCKD models in our artifact inventory have no H100 timing yet. Redrawn as two panels: - Left: production head-to-head, RadiativeHeating 32x32 validated ecCKD (7.79 ms, samples=3) vs the single RRTMGP baseline (244 ms, samples=3), 31.3x speedup. - Right: RadiativeHeating timing across ecCKD k-models, all referenced to the same 244 ms RRTMGP baseline as a horizontal dashed line. Bars are coloured by provenance: validated ecCKD (samples=3), fixed-coeff scaffold (samples=1), published ecCKD model with no H100 timing yet (grey). With the corrected baseline the scaffold speedups land at 13.3x (32x16) and 11.7x (16x16) rather than the warmup-inflated 29.8x / 27.0x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the inflated cross-baseline speedups (29.8x, 27.0x) with the honest comparison against the one defensible RRTMGP baseline (244 ms, n=3, post-warmup), and call out the published 64-g / 96-g ecCKD models that have no H100 timing yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stands up benchmarking/ as a self-contained environment that path-deps on this repo's AnalyticBandRadiation.jl and a developing Breeze checkout, so the H100 RCEMIP benchmark can co-evolve with both repos without going through Breeze's workspace machinery. The benchmark exercises Breeze's update_radiation! entry point (so it measures the production call path) and writes JSON in the schema figures/make_pr_figures.jl reads. H100 sweep at 512x512x128 / 262,144 columns, samples=5 post-warmup median, validated ecCKD provenance for all five published k-model combinations: k-model RH ms RRTMGP ms speedup 32 LW × 32 SW (fsck-32b × rgb-32b) 794.7 5863.3 7.4x 32 LW × 64 SW (fsck-32b × window-64b) 1047.3 5864.2 5.6x 32 LW × 96 SW (fsck-32b × vfine-96b) 1279.3 5867.7 4.6x 64 LW × 32 SW (narrow-64b × rgb-32b) 1352.7 5869.8 4.3x 64 LW × 64 SW (narrow-64b × window-64b) 1603.6 5870.1 3.7x RRTMGP is essentially constant because it runs its own 256-LW / 224-SW k-table regardless of the ecCKD k-counts. RadiativeHeating scales sub-linearly with g-point count under the streaming-kernel rewrite that just landed in BreezeRadiativeHeatingExt (no (ngpt, Nx, Ny, Nz) intermediate optical-property buffers). The figure now has a single defensible RRTMGP baseline (5863.3 ms, samples=5, post-warmup) instead of three warmup-contaminated bars, and all five published ecCKD k-models are measured rather than four shown as "pending". figures/make_pr_figures.jl prefers the local benchmarking/results/ artifacts and falls back to the dedicated Breeze checkout for runs not re-measured under the streaming kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-updated by the running Slurm job that generates the 18 derived ecCKD training flux products and re-runs the preflight + recovery audits. No code changes; just refreshes RUNNING_REVIEW.md and the ckdmip_training_data_preflight, ecckd_derived_flux_generation_plan, and recovery_goal_audit JSON/MD snapshots. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renames the package, module, and extension surfaces:
package: AnalyticBandRadiation -> Lightflux
module: module AnalyticBandRadiation -> module Lightflux
ext: AnalyticBandRadiationNCDatasetsExt -> LightfluxNCDatasetsExt
AnalyticBandRadiationRRTMGPExt -> LightfluxRRTMGPExt
AnalyticBandRadiationSpeedyWeatherExt -> LightfluxSpeedyWeatherExt
`using AnalyticBandRadiation` / `import AnalyticBandRadiation` are
updated everywhere in src/, ext/, test/, validation/, benchmark/,
benchmarking/, docs/, examples/, figures/, and the top-level markdown
notes.
GitHub URL references in the README, docs, and PR_WORK_SUMMARY are kept
literal because the GitHub repo itself is still
`NumericalEarth/AnalyticBandRadiation.jl` (the on-disk directory is also
unchanged). The `Project.toml` `name` field is now `"Lightflux"`, so the
package loads as `using Lightflux` regardless of the directory.
The companion BreezeRadiativeHeatingExt rename in the dev Breeze checkout
lands in a separate commit on branch glw/streaming-ecckd-gpoints in
NumericalEarth/Breeze.jl.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
- Live audit/preflight artifacts now reflect the completed CKDMIP derived- flux generation: ckdmip_training_data_preflight is ready_for_original_ecckd_objective, ecckd_objective_reconstruction_check is ready_to_reconstruct_original_objective, recovery_goal_audit has blocked=0, partial=3. - New validation scripts and tests for the ecCKD reduction work: leave-one-out heating residual localization, heating table optimizer, refit breakdown, single-table move scan, weight coordinate descent (+ continuation, + scan, + boundary polish), published model accuracy, and training recovery targets. - Updated validation scripts (band accuracy pareto, published training manifest, recovery goal audit, official ecCKD training, reduced ecCKD RRTMGP comparison, reduced ecCKD accuracy) and their paired tests. - PR_WORK_SUMMARY.md and RUNNING_REVIEW.md updated with the unblock work and ongoing review notes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gate
- Original-objective recovery scaffolding: ckdmip_original_objective_dataset,
ckdmip_original_objective_ad_batch, ecckd_original_objective_terms,
ecckd_original_objective_loss; with paired tests and result artifacts.
Sets up the AD-batched optimizer probe against the captured published
objective terms.
- Published-model recovery vector path:
ecckd_published_recovery_target, ecckd_published_recovery_vector, and
ecckd_published_recovery_vector_training -- analytic quadratic descent
in log-vector space, 204896 params, final loss 1.6e-17, log-RMSE 1.6e-10.
- Published all-sky accuracy gate (ecckd_published_all_sky_accuracy): all
six promoted official ecCKD combinations (32x32 / 32x64 / 32x96 /
64x32 / 64x64 / 64x96) pass under matched Tripleclouds/aerosol config.
- ecRad reference materialization extended to the same six combinations:
validation/reference/ecrad/ecckd_{32x64,32x96,64x32,64x64,64x96}_*.nc
for all_sky_tropical_column, clear_sky_tropical_column, and
rcemip_style_column_subset.
- ecRad reference optics solver gap + 32x64 gate (paired test +
result artifacts).
- Matched reference plan artifact (ecckd_matched_reference_plan) records
which ecCKD/ecRad combinations are inventoried and recovered.
- Updated audit artifacts: recovery_goal_audit gains 10+ new sub-statuses
(objective_terms_captured, dataset_samples_ready, optimizer_batch_ready,
published_recovery_target_ready, published_recovery_vector_training,
candidate_objective_score_ready, table_writeback_* probes, written_
coordinate_descent_improved). Best reduced candidate so far: 32x31
(63 g-points), worst boundary forcing 0.2998 W/m^-2, passing the 0.30
hard threshold for the first time.
- New REVIEW_HANDOFF.md summarizes the current state for the next operator.
- RUNNING_REVIEW.md and PR_WORK_SUMMARY.md updated with the unblock /
optimization progress and ongoing review notes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Second rename in this PR's history. UUID `cd8119b0-1744-44d6-9ede-6ad1ad750b26`
unchanged across both renames so downstream consumers only need to update the
package name in their manifests.
Chain:
1. `AnalyticBandRadiation` → `Lightflux`
2. `Lightflux` → `NumericalRadiation` (this commit)
Touchpoints (66 files, +197/-192 lines):
- `src/Lightflux.jl` → `src/NumericalRadiation.jl` (file + module decl)
- `ext/Lightflux{NCDatasets,RRTMGP,SpeedyWeather}Ext.jl` →
`ext/NumericalRadiation*Ext.jl` (file + module decl + `using`)
- `Project.toml`: name + `[extensions]` stanza targets
- `docs/Project.toml`, `docs/make.jl`, all `docs/src/**/*.md` references
- `examples/`, `test/`, `validation/`, `benchmark/`, `benchmarking/`:
`using` and `Base.get_extension(NumericalRadiation, :NumericalRadiation*Ext)`
- User prose: `README.md`, `radiative_heating.md`, `PR_WORK_SUMMARY.md`
- `.gitignore`: add `.claude/` + `.handoff_monitor.*` (local session state)
All `Manifest.toml` files are gitignored; they regenerate on next
`Pkg.instantiate()` against the renamed `Project.toml`.
Deferred (NOT in this commit):
- On-disk directory `AnalyticBandRadiation.jl/` → `NumericalRadiation.jl/`
(must be done outside this session; changes CWD).
- Breeze downstream rename: 12 files referencing `Lightflux` plus path-fallback
constants and the `BreezeLightfluxExt` extension at
`BreezeRadiativeHeatingDev/Breeze.jl/`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Renames and rebuilds AnalyticBandRadiation.jl into the standalone, ecRad/ecCKD-compatible radiation package
NumericalRadiationdescribed inradiative_heating.md: staged runtime API, ecCKD ingestion + forward gas-optics, cloudless/cloud-overlap solvers, cloud + aerosol optics scaffolding, NCDatasets and RRTMGP extensions, and anArtifacts.tomlwiring for official ecRad/ecckd data. Adds an executablevalidation/audit harness — accuracy gates against ecRad references, reduced-model search/optimizer artifacts, RRTMGP comparison, AD calibration, ecCKD recovery preflights, plus prompt-to-artifact and recovery-goal audits writing paired.json+.mdtovalidation/results/. Removes the in-repo Breeze extension and lands an independentbenchmarking/project here that path-deps onto a developing Breeze checkout for the production H100 benchmark.Full narrative, status against each gate, and step-by-step reproduction are in
PR_WORK_SUMMARY.md.Package rename history
The Julia package has been renamed twice in this PR. UUID
cd8119b0-1744-44d6-9ede-6ad1ad750b26is stable across both renames, so downstream environments only need to update the package name in their manifests.AnalyticBandRadiation→LightfluxLightflux→NumericalRadiation(current)The GitHub repository URL (
github.com/NumericalEarth/AnalyticBandRadiation.jl) reflects the original name and has not been renamed on GitHub yet. The Julia package name inProject.tomlis nowNumericalRadiation. Source/extension paths:src/NumericalRadiation.jl(wassrc/Lightflux.jl, wassrc/AnalyticBandRadiation.jl)ext/NumericalRadiation{NCDatasets,RRTMGP,SpeedyWeather}Ext.jlDownstream packages with manifests referencing
LightfluxorAnalyticBandRadiationunder UUIDcd8119b0-...need to regenerate or update their manifests to pick up the new name. The Breeze rename plumbing (BreezeLightfluxExt→BreezeNumericalRadiationExt, path-fallback constants inbenchmarking/) lands in a separate Breeze commit.Headline figures
Regenerated from committed artifacts by
figures/make_pr_figures.jl.Accuracy: official 32×32 ecCKD matches ecRad on the clear-sky tropical column
LW flux RMSE ≈ 0.003 W m⁻², SW flux RMSE ≈ 0.007 W m⁻², heating-rate RMSE ≈ 0.006 K day⁻¹. Hard ecCKD cloudless gate passes.
Performance: H100 speedup across published ecCKD k-models at production scale
Measured by the new independent
benchmarking/project (path-deps onto the dev Breeze checkout) through Breeze'supdate_radiation!call surface. Both bars use post-warmup medians from 5 samples on a single H100 + the same 512×512×128 / 262,144-column RCEMIP-style workload.fsck-32b×rgb-32b)fsck-32b×window-64b)fsck-32b×vfine-96b)narrow-64b×rgb-32b)narrow-64b×window-64b)RRTMGP is essentially constant — it runs its own 256-LW × 224-SW lookup independent of the ecCKD k-counts, so the dashed baseline is genuinely one number. RadiativeHeating scales sub-linearly with total g-points thanks to the new streaming kernel in
BreezeRadiativeHeatingExt._tabulated_ecckd_streaming_radiation!, which fuses g-point optical depth into the column transport loop and never materializes(ngpt, Nx, Ny, Nz)four-dimensional intermediates (which at 512×512×128 × 96 g-points would have been ~26 GiB per buffer). The 64 LW × 64 SW row at 3.7× is just below the 4× gate at this scale — an honest reading of where the 64-g production path stands today.A more recent fully-coupled H100 sweep (post-rename) is recorded in
HANDOFF.mdR-10: 164.4 ms/step at 100×100×74 and 2457.2 ms/step at 512×512×128, both with radiation scheduled atIterationInterval(1). That's 5.5–7.2× faster than RRTMGP coupled at the same cadence, and 52–57× slower than dynamics-only — the latter ratio is the target of the in-flight performance campaign scoped byHANDOFF.mdR-7.Training: Reactant + Enzyme calibration of ecCKD coefficients
Left: 13-epoch finite-difference / Enzyme-checked gradient descent on a 4-parameter toy ecCKD fixed-topology fixture — loss × 0.006, flux RMSE × 0.077, heating-rate RMSE × 0.178. Right: 8 epochs of Reactant-compiled, Enzyme reverse-mode gradient descent against the package-native RRTMGP shortwave loss for a 48-parameter 16-g model.
What's left: the reduced 16-g hard gate
Worst boundary-forcing error after each accepted optimizer move in the greedy / constrained-table / slot-blend / weight-refit chain. From 0.0042 W m⁻² at the 32×32 baseline up to ≈7.2 W m⁻² for naive 16-g, then down to ≈2.14 W m⁻² after the full chain — still ≈7× above the 0.30 W m⁻² hard gate.
Acceptance gate status
Notable code changes
src/runtime_interfaces.jl,src/abstract_types.jl: staged optics/solver/backend API.src/io/ecckd_definition.jl,src/gas_optics/ecckd_forward.jl: official ecCKD schema + tabulated forward gas-optics.src/io/cloud_scattering.jl,src/solvers/cloud_optics.jl,src/solvers/cloud{less,overlap}_{longwave,shortwave}.jl: cloud/aerosol scaffolding + solvers.src/metrics.jl: shared validation metrics + thresholds.ext/NumericalRadiationNCDatasetsExt.jl,ext/NumericalRadiationRRTMGPExt.jl: NetCDF loaders + RRTMGP comparison.Artifacts.toml: pinned lazyecrad_dataandecckd_sourceartifacts.validation/: ~50 audit scripts + paired JSON/MD results.benchmarking/: independent H100 benchmark suite (env-controlled, path-deps onto developing Breeze).figures/: PR-narrative figure generator + four PNGs.ext/AnalyticBandRadiationBreezeExt.jldeleted; per-package Breeze extension now lives in the Breeze checkout asBreezeNumericalRadiationExt(rename in flight there).Test plan
julia --project=. -e 'using Pkg; Pkg.test()'julia --project=test validation/recovery_goal_audit.jl— expectnot_completeuntil the live derived-flux Slurm job finishes.julia --project=test validation/ecrad_accuracy_gate.jl+validation/ecrad_all_sky_ifs_gate.jl— hard ecRad gates on official ecCKD.julia --project=figures figures/make_pr_figures.jl— regenerate the four PR figures.sbatch benchmarking/h100_kmodel_coverage.sbatch— reproduce the cross-k-model H100 sweep at 512×512×128.PR_WORK_SUMMARY.md§4.3 for the live ecCKD derived-flux generation.🤖 Generated with Claude Code