Skip to content

Conversation

@jleinonen
Copy link
Collaborator

PhysicsNeMo Pull Request

Adds an example of training a temporal interpolation model withphysicsnemo.models.afno.ModAFNO to examples/weather/temporal_interpolation/. See the README.md located there for details.

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

Copy link
Collaborator

@CharlelieLrt CharlelieLrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass on the PR.
Overall looks fine, except a few things that are a little unclear here and there.
My biggest concern is the Trainer. I know this class is used in many of our examples, but here I don't really see the point. I would support having a dedicated abstraction for a trainer only if you expect users to import it into their own codebase and use it in their own external application.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds a complete temporal interpolation example to examples/weather/temporal_interpolation/, demonstrating how to train a ModAFNO-based model that increases weather forecast temporal resolution from 6-hour to 1-hour intervals. The implementation includes custom DALI datapipes for efficient HDF5 data loading with random timestep sampling, a latitude-weighted geometric loss function, a modular Trainer class with MLflow/Wandb integration, and comprehensive configuration files. The example is designed for ERA5 hourly atmospheric data (73 channels) and supports distributed training with checkpoint resumption. While the code is well-structured and follows PhysicsNeMo patterns, several issues require attention before merge.

Critical Issues

Training Script (train.py)

The return type annotations are malformed and could mislead developers. Line 56 declares -> tuple[InterpClimateDatapipe, InterpClimateDatapipe] but the function returns three values including num_aux_channels. Line 219 has double brackets: -> tuple[[tuple[torch.Tensor, torch.Tensor], torch.Tensor]] instead of -> tuple[tuple[torch.Tensor, torch.Tensor], torch.Tensor].

Lines 240-242 access batch["latlon"], batch["geopotential"], and batch["land_sea_mask"] without verifying these invariants were enabled in the config. If a user disables these features but the code still tries to access them, it will raise a KeyError at runtime rather than at config validation time.

Line 204 uses .pop("model_name") which destructively modifies the config dictionary and will raise a KeyError if the key is missing. Consider using .get() with a default or explicit validation.

Trainer Class (utils/trainer.py)

Lines 129-132 contain a critical bug in epoch initialization. When load_epoch=None (no checkpoint loaded), the code still increments self.start_epoch = epoch + 1, causing training to incorrectly start from epoch 2 instead of epoch 1. This happens because epoch is initialized to 1 on line 127 regardless of whether a checkpoint was loaded.

The validation loss computation on lines 287-310 uses inconsistent averaging compared to training. Training loss divides by total samples seen (self.samples / self.epoch), while validation averages per batch without accounting for potentially different batch sizes. This could lead to misleading validation metrics if the final validation batch is smaller than others.

Lines 147-149 compute batches = samples // batch_size without validating batch_size > 0, risking division by zero. Additionally, there's no handling for the case where samples % batch_size != 0, meaning the last partial batch is silently dropped.

Data Pipeline (datapipe/climate_interp.py)

Line 77 uses np.random.shuffle(self.indices) without seeding, breaking reproducibility across epochs and workers. The commented-out code on line 75 shows the correct seeded approach that should be used instead.

Lines 181-182 leave debug print statements in production code, which will spam logs during distributed training. These should either be removed or converted to proper logger calls with appropriate rank guards.

Configuration Files

Both train_interp.yaml and train_interp_lite.yaml contain hardcoded absolute paths (lines 39-42 in train_interp.yaml) like /data/era5-73varQ-hourly. The invariant files reference a different dataset subdirectory (/data/era5-wind_gust/) than the main data, which may cause confusion or runtime errors if the directory structure doesn't match expectations.

The batch_size_train: 1 setting in train_interp.yaml (line 42) will severely underutilize GPU memory and compute, resulting in very slow training. This should be increased or documented as a deliberate choice.

Recommendations

  1. Fix return type annotations in train.py lines 56 and 219 to accurately reflect the actual return values.

  2. Add defensive checks in train.py around lines 240-242 to verify invariants are enabled before accessing them, or restructure the code to conditionally build the invariants dict.

  3. Correct epoch initialization logic in trainer.py lines 129-132 to properly handle the load_epoch=None case.

  4. Restore seeded shuffling in climate_interp.py line 77 by uncommenting line 75 and removing line 77.

  5. Remove debug print statements from climate_interp.py and replace with proper logging.

  6. Make paths configurable by using placeholders or environment variables instead of hardcoded absolute paths in both config files.

  7. Add input validation for batch_size, dimensions, and other critical parameters to fail fast with clear error messages.

  8. Consider addressing the modularization feedback from previous reviews. The Trainer abstraction may be appropriate here given its reusability, but ensure the core training logic remains understandable by including better docstrings and possibly a training flow diagram in the README.

Confidence: 3/5 - The implementation is generally solid and follows PhysicsNeMo patterns, but contains several bugs and design issues that could cause runtime failures or confusing behavior. The modularization concerns raised in previous reviews are valid but subjective. Most issues are straightforward to fix, though the epoch initialization bug and validation loss computation require careful attention to avoid breaking checkpoint resumption and metric tracking.

10 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The developer has addressed many of the previous feedback items by: (1) adding a validation script (validate.py) with area-weighted error histograms and baseline comparison; (2) fixing the missing requirements.txt with MLflow and W&B dependencies; (3) expanding the README with validation instructions and more technical background; and (4) implementing validation-specific utilities. However, several critical issues remain: the validation command example in the README is missing a closing quote (line 137), the train.py script still has unguarded .pop("model_name") and nested config access that will raise KeyError if keys are missing, the validation script has hardcoded grid assumptions and potential division-by-zero, and the datapipe/climate_interp.py can generate duplicate interpolation indices when interp_idx == self.stride.

Important Files Changed

Filename Score Overview
examples/weather/temporal_interpolation/README.md 4/5 Comprehensive documentation added with model theory, setup, training, and validation instructions; syntax error on line 137 (missing closing quote in validation command).
examples/weather/temporal_interpolation/train.py 3/5 Main training script added with Hydra orchestration; unguarded .pop("model_name") on line 215 and nested config access on line 401 will raise KeyError if keys missing.
examples/weather/temporal_interpolation/validate.py 3/5 Validation script with histogram-based error metrics; hardcoded 721-lat grid breaks for other resolutions, potential division by zero on line 131, lacks bounds checking on line 95.
examples/weather/temporal_interpolation/datapipe/climate_interp.py 4/5 Specialized datapipe for interpolation training; interp_idx can equal self.stride creating duplicates, unused Tensor import, MRO may not work as intended.
examples/weather/temporal_interpolation/utils/trainer.py 3/5 Reusable trainer class with checkpoint resume and WandB integration; epoch numbering issue if no checkpoint found, validation loss accumulation doesn't match training's per-sample weighting.
examples/weather/temporal_interpolation/utils/loss.py 4/5 Latitude-weighted MSE loss for spherical geometry; potential NaN if all weights zero, otherwise well-implemented with pole handling.
examples/weather/temporal_interpolation/utils/distribute.py 4.5/5 DDP initialization helper; unusual but harmless CUDA stream pattern, otherwise clean and correct.
examples/weather/temporal_interpolation/config/train_interp.yaml 4/5 Production training config with hardcoded absolute paths that users must modify; otherwise well-documented and reasonable.
examples/weather/temporal_interpolation/config/train_interp_lite.yaml 4/5 Lightweight test config with hardcoded paths; clearly marked as test-only, appropriate for smoke testing.
examples/weather/temporal_interpolation/data/data.json 5/5 Static metadata for 73-channel ERA5 dataset; well-formed JSON with no issues.
examples/weather/temporal_interpolation/requirements.txt 4/5 Dependency declaration for MLflow and W&B; missing trailing newline will cause POSIX warnings.

Confidence score: 3/5

  • This PR requires careful review due to multiple unhandled edge cases and potential runtime crashes in critical training and validation paths.
  • Score reflects unguarded dictionary access in train.py (KeyError risks on lines 215, 401), hardcoded grid assumptions in validate.py that break portability, potential duplicate indices in the datapipe, missing closing quote in README validation command, and validation loss accumulation inconsistency. These are not just untested edge cases—they are likely failure modes in common usage scenarios (missing config keys, different data resolutions, checkpoint resume).
  • Pay close attention to train.py (lines 215, 401), validate.py (lines 95, 131, 200-202), datapipe/climate_interp.py (line 90), and README.md (line 137). The validation script and datapipe also need bounds checking and grid size validation.

11 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR.

The developer has added a train_interp_lite.yaml configuration file to the temporal interpolation example at examples/weather/temporal_interpolation/config/train_interp_lite.yaml. Based on the file metadata showing a confidence score of 0 and empty critical sections, this appears to be a new file that was not fully analyzed in previous reviews. The "lite" suffix suggests this is a simplified or resource-constrained variant of the main train_interp.yaml configuration, likely intended for testing or demonstration on smaller infrastructure. This configuration would allow users to experiment with the temporal interpolation training pipeline without requiring full-scale compute resources or complete ERA5 datasets. The addition follows the repository pattern seen in other examples where multiple configuration variants (full-scale vs. lightweight) are provided to accommodate different user environments and use cases.


Important Files Changed

Filename Score Overview
examples/weather/temporal_interpolation/config/train_interp_lite.yaml 3/5 New lightweight configuration variant added for temporal interpolation training, details not yet fully analyzed

Confidence score: 3/5

  • This change introduces a new configuration file that likely reduces resource requirements for the temporal interpolation example, making it more accessible for users with limited infrastructure.
  • Score reflects the lack of detailed analysis in the file metadata (empty critical sections and 0 initial confidence score suggest incomplete review data), preventing assessment of potential hardcoded paths, resource settings, or compatibility issues that may exist.
  • The train_interp_lite.yaml file requires careful review to ensure it provides a genuinely usable lightweight training path and does not contain the same hardcoded absolute path issues identified in train_interp.yaml (lines 34-48 of that file).

11 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

@CharlelieLrt CharlelieLrt changed the base branch from main to 1.3.0-rc October 30, 2025 22:29
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds a comprehensive example for training temporal interpolation models using physicsnemo.models.afno.ModAFNO, including training/validation scripts, custom datapipes, and configuration files.

Key Changes:

  • New example in examples/weather/temporal_interpolation/ with complete training pipeline
  • Custom InterpClimateDatapipe extending PhysicsNeMo's climate datapipe for interpolation tasks
  • Trainer class implementing the training loop with checkpoint management and distributed training support
  • Validation script for computing error histograms across interpolation timesteps
  • Geometric L2 loss with latitude weighting for proper error calculation on spherical grids
  • Full documentation in README with setup, training, and validation instructions

Implementation Quality:
The code is generally well-structured and follows good practices. However, there are several logic issues that need attention, particularly around error handling for missing config keys, epoch initialization, and data reproducibility.

Confidence Score: 2/5

  • This PR has multiple logic issues that could cause runtime errors or incorrect behavior
  • Several critical issues exist: (1) epoch initialization bug when load_epoch=None causes training to start from epoch 2, (2) potential KeyError exceptions in train.py when config keys are missing, (3) reproducibility issue with unseeded shuffle in datapipe, (4) inconsistent loss averaging between training and validation, (5) potential division by zero in trainer.py. While these are fixable, they affect core functionality.
  • Pay close attention to utils/trainer.py (epoch initialization, division by zero, loss averaging), train.py (KeyError handling), and datapipe/climate_interp.py (reproducibility)

Important Files Changed

File Analysis

Filename Score Overview
examples/weather/temporal_interpolation/train.py 3/5 Main training script with setup functions for datapipes, model, and optimizer; has several potential KeyError issues when accessing config keys without fallbacks
examples/weather/temporal_interpolation/validate.py 3/5 Validation script for computing error histograms; has potential division by zero and boundary issues with step calculation
examples/weather/temporal_interpolation/utils/trainer.py 2/5 Training loop implementation; has critical issues with epoch initialization when load_epoch=None, division by zero risk, and inconsistent loss averaging between training/validation
examples/weather/temporal_interpolation/datapipe/climate_interp.py 2/5 Custom datapipe for interpolation; has reproducibility issue with unseeded shuffle and potential duplicate step generation when interp_idx == self.stride

Sequence Diagram

sequenceDiagram
    participant User
    participant Main as train.py::main()
    participant Setup as setup_trainer()
    participant DP as InterpClimateDatapipe
    participant Model as ModAFNO
    participant Trainer
    participant Loss as GeometricL2Loss

    User->>Main: Start training
    Main->>Setup: Initialize training components
    Setup->>DP: setup_datapipes()
    DP-->>Setup: train & valid datapipes
    Setup->>Model: setup_model()
    Model-->>Setup: ModAFNO instance
    Setup->>Setup: distribute_model()
    Setup->>Setup: setup_optimizer()
    Setup->>Loss: GeometricL2Loss()
    Loss-->>Setup: Loss function
    Setup->>Trainer: Trainer(model, datapipes, optimizer, loss)
    Trainer-->>Setup: trainer instance
    Setup-->>Main: trainer
    
    Main->>Trainer: fit()
    loop For each epoch
        loop For each batch
            Trainer->>DP: next(train_iterator)
            DP-->>Trainer: batch data
            Trainer->>Trainer: input_output_from_batch_data()
            Trainer->>Model: forward(*invar)
            Model-->>Trainer: predictions
            Trainer->>Loss: loss(pred, true)
            Loss-->>Trainer: loss value
            Trainer->>Trainer: optimizer.step()
        end
        Trainer->>Trainer: validate_on_epoch()
        loop Validation batches
            Trainer->>DP: valid_datapipe
            DP-->>Trainer: valid batch
            Trainer->>Model: eval_step()
            Model-->>Trainer: predictions
            Trainer->>Loss: loss(pred, true)
        end
        Trainer->>Trainer: save_checkpoint()
        Trainer->>Trainer: lr_scheduler.step()
    end
    Trainer-->>Main: Training complete
    Main-->>User: Done
Loading

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

steps = np.arange(self.stride + 1)
else:
steps = np.array([0, self.stride, interp_idx])
state_seq = self._load_sequence(year_idx, in_idx, steps)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mismatch in the number of arguments passed to self._load_sequence, it expects 3 (self, year_idx, sample_idx), but 4 are given here. It does not seem to accept a steps argument

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self._load_sequence that actually gets called is here and it does accept a steps argument:

def _load_sequence(
self, year_idx: int, idx: int, steps: np.ndarray, num_retries: int = 10
) -> np.ndarray:

This is different from the call signature in ClimateDaliExternalSource in PhysicsNeMo and I understand it's probably confusing but it doesn't break anything in the example (I have run the training example end-to-end and it runs fine).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it's probably confusing

Yes, that's so confusing that it's unmaintainable. I gave a try to refactor these sources in climate_interp.py but the class inheritance is so convoluted that I couldn't even get started. Also, the fact that all classes in climate_interp.py have no explicit arguments, but *args and **kwargs that are implicitly passed to the parent class makes it even more difficult to figure out where to start. This pattern is ok when the inheritance is straightforward, but it should be avoided with such a complex inheritance tree.

Copy link
Collaborator

@CharlelieLrt CharlelieLrt Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InterpHDF5DaliExternalSource

such a complex inheritance tree

Actually that's not even a tree, but almost cyclic. This needs to be cleaned up a bit, I think. Would there be a way to merge the two classes that you defined in the climate_interp.py into one? For me it's not clear why one should inherit from the other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall the idea was to leave a way to add support for other file formats. For instance to use NetCDF4 files one could define something like

class InterpNetCDF4DaliExternalSource(
    ClimateNetCDF4DaliExternalSource, InterpDaliExternalSource
):

But yes this makes the inheritance structure very complicated. I think it's safe to merge the two classes from climate_interp.py if we don't care too much about extending to file formats beyond HDF5.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jleinonen ok, I merged the two data source classes in climate_interp.py. Could you give it a quick try to make sure it's still works as intended?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds comprehensive temporal interpolation model example using ModAFNO to increase temporal resolution of AI-based weather forecasts from 6-hour to 1-hour intervals.

Key Changes:

  • Complete training pipeline with custom datapipes for HDF5 climate data at hourly resolution
  • ModAFNO-based interpolation model with conditioning on interpolation timestep
  • Custom training loop with checkpoint management, distributed training support, and MLFlow/WandB logging
  • Validation script for computing error histograms across interpolation steps
  • Latitude-weighted L2 loss function with proper geometric weighting
  • Comprehensive documentation and configuration files for both production and testing

Notable Implementation Details:

  • Deterministic seeded sampling ensures reproducibility across distributed workers
  • IO retry logic handles occasional filesystem errors in HDF5 reading
  • Static capture wrappers optimize training and evaluation loops
  • Recent commits added fixes for step clamping and removed unused imports

Minor Issue:

  • train.py:215 uses pop("model_name") without default, which will fail if the key is missing from config (though all provided configs include it)

Confidence Score: 4/5

  • This PR is safe to merge with only one minor syntax issue that should be addressed
  • The implementation is comprehensive and well-structured with proper error handling, documentation, and testing support. Recent commits have already addressed several issues (step clamping, unused imports). The only remaining issue is a missing default value in pop("model_name") which could cause a KeyError if the config is incomplete, though all provided configs include this field. The code demonstrates good practices with proper license headers, type hints, docstrings, and distributed training support.
  • examples/weather/temporal_interpolation/train.py needs the model_name KeyError fix on line 215

Important Files Changed

File Analysis

Filename Score Overview
examples/weather/temporal_interpolation/train.py 4/5 Adds main training script for temporal interpolation model with ModAFNO; includes comprehensive setup for datapipes, models, optimizers, and logging with MLFlow/WandB support
examples/weather/temporal_interpolation/validate.py 5/5 Validation script for computing interpolation error histograms; includes fixes for step clamping and unused imports from recent commits
examples/weather/temporal_interpolation/utils/trainer.py 4/5 Custom training loop implementation with checkpoint management, validation, and WandB logging; has some edge cases with epoch tracking
examples/weather/temporal_interpolation/utils/loss.py 5/5 Implements latitude-weighted L2 loss with proper geometric weighting; well-structured with appropriate pole handling
examples/weather/temporal_interpolation/datapipe/climate_interp.py 4/5 Custom datapipe for interpolation training with specialized DALI external sources; handles HDF5 climate data with retry logic for IO errors

Sequence Diagram

sequenceDiagram
    participant User
    participant train.py
    participant Trainer
    participant DataPipe
    participant Model as ModAFNO
    participant Loss as GeometricL2Loss
    
    User->>train.py: Execute with config
    train.py->>train.py: setup_datapipes()
    train.py->>DataPipe: InterpClimateDatapipe.init()
    DataPipe-->>train.py: train_datapipe, valid_datapipe
    
    train.py->>train.py: setup_model()
    train.py->>Model: ModAFNO.init()
    Model-->>train.py: model instance
    
    train.py->>train.py: setup_optimizer()
    train.py->>Trainer: Trainer.init()
    
    Trainer->>Trainer: load_checkpoint() if load_epoch set
    
    User->>Trainer: fit()
    
    loop For each epoch
        loop For each batch
            Trainer->>DataPipe: next(train_iterator)
            DataPipe-->>Trainer: batch data
            Trainer->>Trainer: input_output_from_batch_data()
            Trainer->>Model: forward(invar)
            Model-->>Trainer: prediction
            Trainer->>Loss: loss(pred, true)
            Loss-->>Trainer: loss value
            Trainer->>Trainer: optimizer.step()
        end
        
        Trainer->>Trainer: validate_on_epoch()
        loop For each validation batch
            Trainer->>DataPipe: iterate valid_datapipe
            DataPipe-->>Trainer: validation batch
            Trainer->>Model: eval_step(invar)
            Model-->>Trainer: prediction
            Trainer->>Loss: loss(pred, true)
            Loss-->>Trainer: loss value
        end
        
        Trainer->>Trainer: lr_scheduler.step()
        Trainer->>Trainer: save_checkpoint() if checkpoint_epoch
    end
    
    Trainer-->>User: Training complete
Loading

12 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

This PR adds a comprehensive example for training temporal interpolation models using ModAFNO to increase the temporal resolution of weather forecasts from 6-hour to 1-hour intervals. The implementation includes training/validation scripts, custom datapipes, configuration files, and thorough documentation.

Key additions:

  • Complete training pipeline with distributed training support via DDP
  • Custom InterpClimateDatapipe extending PhysicsNeMo's climate datapipe
  • Latitude-weighted geometric L2 loss for better accuracy near poles
  • Validation script for computing error histograms by interpolation step
  • Configuration files for both full-scale and lightweight test runs
  • Comprehensive README with setup instructions and technical background

Issues identified:

  • Multiple logic errors in utils/trainer.py related to checkpoint loading, validation loss calculation, and potential division by zero
  • Critical reproducibility issue in datapipe/climate_interp.py where unseeded shuffle breaks deterministic behavior across workers
  • Potential edge case where interp_idx == stride creates duplicate timesteps in training data
  • Several KeyError risks in train.py from direct dictionary access without defaults
  • Minor issues with hardcoded assumptions in validation script

Confidence Score: 3/5

  • This PR has several logical errors that could affect training correctness, reproducibility, and robustness, particularly in the trainer and datapipe implementations.
  • Score reflects multiple logic errors flagged in previous comments that need resolution: checkpoint loading increments epoch incorrectly causing training to start from epoch 2, validation loss calculation is per-batch rather than per-sample (inconsistent with training), unseeded shuffle in datapipe breaks reproducibility, potential division by zero if batch_size is 0, and several KeyError risks. While the overall architecture is sound and documentation is excellent, these issues could lead to unexpected behavior in production use.
  • examples/weather/temporal_interpolation/utils/trainer.py and examples/weather/temporal_interpolation/datapipe/climate_interp.py require the most attention due to logic errors affecting training correctness and reproducibility.

Important Files Changed

File Analysis

Filename Score Overview
examples/weather/temporal_interpolation/train.py 4/5 Main training script that sets up datapipes, model, and trainer. Some minor issues with KeyError handling on dict access remain.
examples/weather/temporal_interpolation/validate.py 4/5 Validation script for computing interpolation error histograms. Recent commit added clamping for step boundary issue. Minor issues remain with hardcoded latitude assumptions.
examples/weather/temporal_interpolation/datapipe/climate_interp.py 3/5 Custom datapipe for interpolation training. Critical reproducibility issue with unseeded shuffle, and potential issue with duplicate timesteps when interp_idx equals stride.
examples/weather/temporal_interpolation/utils/trainer.py 3/5 Training loop implementation. Multiple logic issues identified: checkpoint loading increments epoch incorrectly, validation loss calculation inconsistent with training, and potential division by zero.
examples/weather/temporal_interpolation/utils/loss.py 4/5 Latitude-weighted L2 loss implementation. Potential NaN issue if normalization produces zero weights, though unlikely with cosine latitude weighting.

Sequence Diagram

sequenceDiagram
    participant User
    participant train.py
    participant Trainer
    participant InterpClimateDatapipe
    participant InterpHDF5Source
    participant ModAFNO
    participant GeometricL2Loss
    
    User->>train.py: Execute training
    train.py->>train.py: setup_datapipes()
    train.py->>InterpClimateDatapipe: Initialize train & valid pipes
    InterpClimateDatapipe->>InterpHDF5Source: Create data source
    
    train.py->>train.py: setup_model()
    train.py->>ModAFNO: Initialize model
    
    train.py->>train.py: setup_optimizer()
    
    train.py->>Trainer: Initialize trainer
    Trainer->>Trainer: load_checkpoint() if specified
    
    loop For each epoch
        Trainer->>Trainer: fit()
        loop For each batch
            Trainer->>InterpClimateDatapipe: Get next batch
            InterpClimateDatapipe->>InterpHDF5Source: Load sequence [t0, t_end, t_interp]
            InterpHDF5Source-->>InterpClimateDatapipe: Return state_seq + timestamps
            InterpClimateDatapipe-->>Trainer: Return batch dict
            
            Trainer->>Trainer: input_output_from_batch_data()
            Trainer->>ModAFNO: Forward pass (x_t0, x_t_end, dt)
            ModAFNO-->>Trainer: Predicted x_t_interp
            
            Trainer->>GeometricL2Loss: Compute loss
            GeometricL2Loss-->>Trainer: Weighted MSE loss
            
            Trainer->>Trainer: Backward pass + optimize
        end
        
        Trainer->>Trainer: validate_on_epoch()
        loop For each validation batch
            Trainer->>InterpClimateDatapipe: Get valid batch
            Trainer->>ModAFNO: eval_step()
            Trainer->>GeometricL2Loss: Compute validation loss
        end
        
        Trainer->>Trainer: save_checkpoint()
    end
    
    Trainer-->>User: Training complete
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants