diff --git a/.cursor/rules/model-rule-000-organization-of-new-classes.mdc b/.cursor/rules/model-rule-000-organization-of-new-classes.mdc new file mode 100644 index 0000000000..0ab5049b94 --- /dev/null +++ b/.cursor/rules/model-rule-000-organization-of-new-classes.mdc @@ -0,0 +1,82 @@ +--- +description: Defines where to place model classes - reusable layers go in physicsnemo/nn, complete models go in physicsnemo/models, with exceptions for example-specific code. +alwaysApply: false +--- + +When generating or refactoring model code, rule MOD-000 must be followed. +Explicitly reference the rule ID MOD-000 and quote relevant sections when +explaining placement decisions. For example: "Following rule MOD-000, which +states that reusable layers should go in physicsnemo/nn, this +reusable layer should be placed in physicsnemo/nn/..." + +## MOD-000: Models organization and where to place new model classes + +**Description:** + +There are two types of models in PhysicsNeMo: + +- Reusable layers that are the building blocks of more complex architectures. + Those should go into `physicsnemo/nn`. Those include for instance + `FullyConnected`, various variants of attention layers, `UNetBlock` (a block + of a U-Net), etc. + All layers that are directly exposed to the user should be imported in + `physicsnemo/nn/__init__.py`, such that they can be used as follows: + ```python + from physicsnemo.nn import MyLayer + ``` +- More complete models, composed of multiple layers and/or other sub-models. + Those should go into `physicsnemo/models`. All models that are directly + exposed to the user should be imported in `physicsnemo/models/__init__.py`, + such that they can be used as follows: + ```python + from physicsnemo.models import MyModel + ``` + +The only exception to this rule is for models or layers that are highly specific to a +single example. In this case, it may be acceptable to place them in a module +specific to the example code, such as for example +`examples//utils/nn.py`. + +**Rationale:** +Ensures consistency and clarity in the organization of models in the +repository, in particular a clear separation between reusable layers and more +complete models that are applicable to a specific domain or specific data +modality. + +**Example:** + +```python +# Good: Reusable layer in physicsnemo/nn/attention.py +class MultiHeadAttention(Module): + """A reusable attention layer that can be used in various architectures.""" + pass + +# Good: Complete model in physicsnemo/models/transformer.py +class TransformerModel(Module): + """A complete transformer model composed of attention and feedforward layers.""" + def __init__(self): + super().__init__() + self.attention = MultiHeadAttention(...) + self.ffn = FeedForward(...) + +# Good: Example-specific utility in examples/weather/utils/nn.py +class WeatherSpecificLayer(Module): + """Layer highly specific to the weather forecasting example.""" + pass +``` + +**Anti-pattern:** + +```python +# WRONG: Complete model placed in physicsnemo/nn/ instead of physicsnemo/models/ +# File: physicsnemo/nn/transformer.py +class TransformerModel(Module): + """Should be in physicsnemo/models/ not physicsnemo/nn/""" + pass + +# WRONG: Reusable layer placed in physicsnemo/models/ instead of physicsnemo/nn/ +# File: physicsnemo/models/attention.py +class MultiHeadAttention(Module): + """Should be in physicsnemo/nn/ not physicsnemo/models/""" + pass +``` diff --git a/.cursor/rules/model-rule-001-class-inheritance-for-new-models.mdc b/.cursor/rules/model-rule-001-class-inheritance-for-new-models.mdc new file mode 100644 index 0000000000..567d8af729 --- /dev/null +++ b/.cursor/rules/model-rule-001-class-inheritance-for-new-models.mdc @@ -0,0 +1,45 @@ +--- +description: All model classes must inherit from physicsnemo.Module (not torch.nn.Module directly) to ensure proper serialization, versioning, and registry functionality. +alwaysApply: false +--- + +When creating or modifying model classes, rule MOD-001 must be strictly followed. Explicitly reference "Following rule MOD-001" when explaining inheritance decisions and quote the requirement that all models inherit from `physicsnemo.Module`. + +## MOD-001: Use proper class inheritance for all models + +**Description:** +All model classes must inherit from `physicsnemo.Module`. Direct subclasses of +`torch.nn.Module` are not allowed. Direct subclasses of `physicsnemo.Module` +are allowed (note that `physicsnemo.Module` is a subclass of `torch.nn.Module`). +Ensure proper initialization of parent classes using `super().__init__()`. Pass +the `meta` argument to the `super().__init__()` call if appropriate, otherwise +set it manually with `self.meta = meta`. + +**Rationale:** +Ensures invariants and functionality of the `physicsnemo.Module` class for all +models. In particular, instances of `physicsnemo.Module` benefit from features +that are not available in `torch.nn.Module` instances. Those include serialization +for checkpointing and loading modules and submodules, versioning system to +handle backward compatibility, as well as ability to be registered in the +`physicsnemo.registry` for easy instantiation and use in any codebase. + +**Example:** + +```python +from physicsnemo import Module + +class MyModel(Module): + def __init__(self, input_dim: int, output_dim: int): + super().__init__(meta=MyModelMetaData()) + self.linear = nn.Linear(input_dim, output_dim) +``` + +**Anti-pattern:** + +```python +from torch import nn + +class MyModel(nn.Module): + def __init__(self, input_dim: int, output_dim: int): + self.linear = nn.Linear(input_dim, output_dim) +``` diff --git a/.cursor/rules/model-rule-002-models-lifecycle.mdc b/.cursor/rules/model-rule-002-models-lifecycle.mdc new file mode 100644 index 0000000000..dd9c298248 --- /dev/null +++ b/.cursor/rules/model-rule-002-models-lifecycle.mdc @@ -0,0 +1,126 @@ +--- +description: Defines the 4-stage lifecycle for model classes - experimental, production, pre-deprecation, and deprecation - ensuring structured development and backward compatibility. +alwaysApply: false +--- + +When creating or moving model classes, rule MOD-002 must be followed. +Explicitly reference "Following rule MOD-002" and state which lifecycle stage +applies. For example: "Following rule MOD-002, which states that new model classes should start in the experimental stage, and that new model classes should go in physicsnemo/experimental/models/, this new model should start in +physicsnemo/experimental/models/..." + +## MOD-002: Model classes lifecycle + +**Description:** +All model classes must follow the following lifecycle: + +- Stage 1: Creation. This is the stage where the model class is created. For + the vast majority of models, new classes are created either in + `physicsnemo/experimental/nn` for reusable layers, or in + `physicsnemo/experimental/models` for more complete models. The `experimental` + folder is used to store models that are still under development (beta or + alpha releases) during this stage, backward compatibility is not guaranteed. + One exception is when the developer is highly confident that the model + is sufficiently mature and applicable to many domains or use cases. In this + case the model class can be created in the `physicsnemo/nn` or `physicsnemo/models` + folders directly, and backward compatibility is guaranteed. Another exception + is when the model class is highly specific to a single example. In this case, + it may be acceptable to place it in a module specific to the example code, + such as for example `examples//utils/nn.py`. + +- Stage 2: Production. After staying in stage 1 for a sufficient amount of time + (typically at least 1 release cycle), the model class is promoted to stage 2. + It is then moved to the `physicsnemo/nn` or `physicsnemo/models` folders, + based on the rule `MOD-000`. During this stage, backward compatibility is + guaranteed. Note that a model is not allowed to be promoted to stage 2 + without the CI tests required by rule MOD-008. + +- Stage 3: Pre-deprecation. For a model class in stage 3 in `physicsnemo/nn` or + `physicsnemo/models`, the developer should start planning its deprecation. + This is done by adding a warning message to the model class, indicating that + the model class is deprecated and will be removed in a future release. The + warning message should be a clear and concise message that explains why the + model class is being deprecated and what the user should do instead. The + deprecation message should be added to both the docstring and should be + raised at runtime. The developer is free to choose the mechanism to raise the + deprecation warning. A model class cannot be deprecated without staying in + stage 3 "pre-deprecation" for at least 1 release cycle. + +- Stage 4: Deprecation. After staying in stage 3 "pre-deprecation" for at least 1 + release cycle, the model class is deprecated. It can be deleted from the + codebase. + +**Rationale:** +This lifecycle ensures a structured approach to model development and maintenance. +The experimental stage allows rapid iteration without backward compatibility +constraints, enabling developers to refine APIs based on user feedback. The +production stage provides stability for users who depend on these models. The +pre-deprecation and deprecation stages ensure users have sufficient time to +migrate to newer alternatives, preventing breaking changes that could disrupt +their workflows. This graduated approach balances innovation with stability, +a critical requirement for a scientific computing framework. + +**Example:** + +```python +# Good: Stage 1 - New experimental model +# File: physicsnemo/experimental/models/new_diffusion.py +class DiffusionModel(Module): + """New diffusion model under active development. API may change.""" + pass + +# Good: Stage 2 - Promoted to production after 1 release cycle +# File: physicsnemo/models/diffusion.py (moved from experimental/) +class DiffusionModel(Module): + """Stable diffusion model with backward compatibility guarantees.""" + pass + +# Good: Stage 3 - Pre-deprecation with warning +# File: physicsnemo/models/old_diffusion.py +class DiffusionModel(Module): + """ + Legacy diffusion model. + + .. deprecated:: 0.5.0 + ``OldDiffusionModel`` is deprecated and will be removed in version 0.7.0. + Use :class:`~physicsnemo.models.NewDiffusionModel` instead. + """ + def __init__(self): + import warnings + warnings.warn( + "OldDiffusionModel is deprecated. Use DiffusionModel instead.", + DeprecationWarning, + stacklevel=2 + ) + super().__init__() + +# Good: Stage 4 - Model removed after deprecation period +# (File deleted from codebase) +``` + +**Anti-pattern:** + +```python +# WRONG: New model directly in production folder without experimental phase +# File: physicsnemo/models/brand_new_model.py (should be in experimental/ first) +class BrandNewModel(Module): + """Skipped experimental stage - risky for stability""" + pass + +# WRONG: Breaking changes in production without deprecation cycle +# File: physicsnemo/models/diffusion.py +class DiffusionModel(Module): + def __init__(self, new_required_param): # Breaking change! + # Changed API without deprecation warning - breaks user code + pass + +# WRONG: Deprecation without sufficient warning period +# (Model deprecated and removed in same release) + +# WRONG: No deprecation warning in code +# File: physicsnemo/models/old_model.py +class OldModel(Module): + """Will be removed next release.""" # Docstring mentions it but no runtime warning + def __init__(self): + # Missing: warnings.warn(..., DeprecationWarning) + super().__init__() +``` diff --git a/.cursor/rules/model-rule-003-model-classes-documentation.mdc b/.cursor/rules/model-rule-003-model-classes-documentation.mdc new file mode 100644 index 0000000000..861fd6fa9c --- /dev/null +++ b/.cursor/rules/model-rule-003-model-classes-documentation.mdc @@ -0,0 +1,301 @@ +--- +description: Comprehensive docstring requirements for model classes including NumPy style, LaTeX math notation for tensor shapes, mandatory sections (Parameters, Forward, Outputs), and CI-tested Examples. +alwaysApply: false +--- + +When writing or modifying docstrings, rule MOD-003 must be followed. Explicitly reference "Following rule MOD-003" and quote specific documentation requirements when explaining docstring decisions. For example: "Following rule MOD-003, which states that all docstrings should be written in the NumPy style, and adopt formatting to be compatible with our Sphinx restructured text (RST) documentation, ..." + +## MOD-003: Model classes documentation + +**Description:** + +Every new model or modification of any model code should be documented with a +comprehensive docstring. Each method of the model class should be documented with a +docstring as well. The forward method should be documented in the docstring of the +model class, instead of being in the docstring of the forward method itself. A +docstring for the forward is still possible but it should be concise and to the +point. To document the forward method, use the sections `Forward` and +`Outputs`. In addition, all docstrings should be written in the NumPy style, +and adopt formatting to be compatible with our Sphinx restructured text (RST) +documentation. + +The docstrings should follow the following requirements: + +- Each docstring should be prefixed with `r"""`. + +- The class docstring should at least contain three sections: `Parameters`, + `Forward`, and `Outputs`. Other sections such as `Notes`, `Examples`, + or `..important::` or `..code-block:: python` are possible. Other sections + are not recognized by our Sphinx documentation and are prohibited. + +- All methods should be documented with a docstring, with at least a `Parameters` + section and a `Returns` section. Other sections such as `Notes`, `Examples`, + or `..important::` or `..code-block:: python` are possible. Other sections + are not recognized by our Sphinx documentation and are prohibited. + +- All tensors should be documented with their shape, using LaTeX math notation such + as :math:`(N, C, H_{in}, W_{in})` (there is flexibility for naming the + dimensions, but the math format should be enforced). Our documentation is + rendered using LaTeX, and supports a rich set of LaTeX commands, so + it is recommended to use LaTeX commands whenever possible for mathematical + variables in the docstrings. The mathematical notations should be to some degree + consistent with the actual variable names in the code (even though + that is not always possible, to avoid too complex formatting). + +- For arguments or variables that are callback functions, (e.g. Callable), the + docstring should include a clear separated ..code-block:: that specifies the + required signature and return type of the callback function. This is not only + true for callback functions, but for any type of parameters or arguments that + has some complex type specification or API requirements. The explanation code + block should be placed in the top or bottom section of the docstrings, but + not in the `Parameters` or `Forward` or `Outputs` sections, for readability + and clarity. + +- Inline code should be formatted double backticks, such as ``my_variable``. + Single backticks are not allowed as they don't render properly in our Sphinx + documentation. + +- All parameters should be documented with their type and default values on a + single line. + +- When possible, docstrings should use links to other docstrings, such as + :class:`~physicsnemo.models.some_model.SomeModel`, or + :func:`~physicsnemo.utils.common_function`, or + :meth:`~physicsnemo.models.some_model.SomeModel.some_method`. + +- When referencing external resources, such as papers, websites, or other + documentation, docstrings should use links to the external resource in the + format `some link text `_. + +- Docstrings are strongly encouraged to have an `Examples` section that + demonstrates basic construction and usage of the model. These example sections + serve as both documentation and tests, as our CI system automatically tests + these code sections for correctness when present. Examples should be + executable Python code showing typical use cases, including model + instantiation, input preparation, and forward pass execution. + +**Rationale:** +Comprehensive and well-formatted documentation is essential for scientific +software. It enables users to understand model capabilities, expected inputs, +and outputs without inspecting source code. LaTeX math notation and proper +formatting ensure documentation renders correctly in Sphinx, creating +professional, publication-quality documentation. Consistent documentation +standards facilitate automatic documentation generation, improve code +discoverability, and help AI agents understand code context. For a framework +used in scientific research, clear documentation of tensor shapes, mathematical +formulations, and API contracts is critical for reproducibility and correct usage. + +**Example:** + +```python +from typing import Callable, Optional +from physicsnemo.models import Module +import torch + +class SimpleEncoder(Module): + r""" + # Rule: Docstring starts with r (for raw string) followed by three double quotes for proper LaTeX rendering + A simple encoder network that transforms input features to a latent representation. + + This model applies a sequence of linear transformations with activation functions + to encode input data into a lower-dimensional latent space. The architecture is + based on the approach described in `Autoencoder Networks `_. + # Rule: External references use proper link format + + The model supports custom preprocessing via a callback function that is applied + to the input before encoding. + + .. code-block:: python + + # Rule: Callback functions or complex API requirements documented in a concise code-block + def preprocess_fn(x: torch.Tensor) -> torch.Tensor: + ... + return y + + where ``x`` is the input tensor of shape :math:`(B, D_{in})` and ``y`` is the output tensor of shape :math:`(B, D_{out})`. + + Parameters + ---------- + # Rule: Parameters section is mandatory + input_dim : int + # Rule: Type and description on single line + Dimension of input features. + latent_dim : int + Dimension of the latent representation. + hidden_dim : int, optional, default=128 + # Rule: Default values are documented on same line + Dimension of the hidden layer. + activation : str, optional, default="relu" + Activation function to use. See :func:`~torch.nn.functional.relu` for details. + # Rule: Cross-references use proper Sphinx syntax + preprocess_fn : Callable[[torch.Tensor], torch.Tensor], optional, default=None + Optional preprocessing function applied to input. See the code block above + for the required signature. + # Rule: Callback functions documented with reference to code-block + + Forward + # Rule: Forward section documents the forward method in class docstring + ------- + x : torch.Tensor + Input tensor of shape :math:`(B, D_{in})` where :math:`B` is batch size + # Rule: Tensor shapes use LaTeX math notation with :math: + and :math:`D_{in}` is input dimension. + return_hidden : bool, optional, default=False + If ``True``, also returns hidden layer activations. + # Rule: Inline code uses double backticks + + Outputs + # Rule: Outputs section is mandatory + ------- + torch.Tensor or tuple + If ``return_hidden`` is ``False``, returns latent representation of shape + :math:`(B, D_{latent})`. If ``True``, returns tuple of + (latent, hidden) where hidden has shape :math:`(B, D_{hidden})`. + + Examples + # Rule: Examples section is allowed and helpful + -------- + >>> model = SimpleEncoder(input_dim=784, latent_dim=64) + >>> x = torch.randn(32, 784) + >>> latent = model(x) + >>> latent.shape + torch.Size([32, 64]) + + Notes + # Rule: Notes section is allowed for additional context + ----- + This encoder can be used as part of a larger autoencoder architecture + by combining it with a decoder network such as + :class:`~physicsnemo.models.decoder.SimpleDecoder`. + # Rule: Cross-references to other classes use proper Sphinx syntax + """ + + def __init__( + self, + input_dim: int, + latent_dim: int, + hidden_dim: int = 128, + activation: str = "relu", + preprocess_fn: Optional[Callable[[torch.Tensor], torch.Tensor]] = None + ): + super().__init__(meta=SimpleEncoderMetaData()) + self.fc1 = torch.nn.Linear(input_dim, hidden_dim) + self.fc2 = torch.nn.Linear(hidden_dim, latent_dim) + self.preprocess_fn = preprocess_fn + + def forward(self, x: torch.Tensor, return_hidden: bool = False): + """Concise forward docstring referencing class docstring for details.""" + # Rule: Forward method can have concise docstring, main docs in class + if self.preprocess_fn is not None: + x = self.preprocess_fn(x) + hidden = torch.relu(self.fc1(x)) + latent = self.fc2(hidden) + if return_hidden: + return latent, hidden + return latent + + def compute_reconstruction_loss( + self, + latent: torch.Tensor, + target: torch.Tensor, + ) -> torch.Tensor: + r""" + Compute mean squared error between latent representation and target. + # Rule: Methods have their own docstrings with Parameters and Returns sections + + Parameters + ---------- + # Rule: Parameters section is mandatory for methods + latent : torch.Tensor + Latent representation of shape :math:`(B, D_{latent})`. + # Rule: Tensor shapes documented with :math: + target : torch.Tensor + Target tensor of shape :math:`(B, D_{latent})`. + + Returns + ------- + # Rule: Returns section is mandatory for methods (not "Outputs" - that's for forward) + torch.Tensor + Scalar loss value. + """ + return torch.nn.functional.mse_loss(latent, target) +``` + +**Anti-pattern:** + +```python +from physicsnemo.models import Module +import torch + +class BadEncoder(Module): + ''' + # WRONG: Should use r (for raw string) followed by three double quotes for docstrings not three single quotes + A simple encoder network + # WRONG: missing Parameters, Forward, or Outputs sections in the docstring + # WRONG: callback function preprocess_fn is not documented at all + ''' + def __init__(self, input_dim, latent_dim, hidden_dim=128, preprocess_fn=None): + # WRONG: No type hints in signature + # WRONG: preprocess_fn callback parameter is missing from docstring entirely + super().__init__() + self.fc1 = torch.nn.Linear(input_dim, hidden_dim) + self.fc2 = torch.nn.Linear(hidden_dim, latent_dim) + self.preprocess_fn = preprocess_fn + + def forward(self, x, return_hidden=False): + # WRONG: No type hints + """ + Forward pass of the encoder. + # WRONG: Should be brief, main docs for the forward method should be in the class docstring + + Args: + x: input tensor with shape (B, D_in) + # WRONG: Using "Args" instead of NumPy-style "Parameters" + # WRONG: Not using :math: with backticks for shapes and other mathematical notations + return_hidden (bool): whether to return hidden state + # WRONG: Mixing documentation styles + + Returns: + encoded representation + # WRONG: Using "Returns" instead of "Outputs". The "Returns" section is for methods other than the forward method. The forward method should have an "Outputs" section instead. + # WRONG: No shape information + """ + # WRONG: Entire forward signature and behavior should be in class docstring + if self.preprocess_fn is not None: + x = self.preprocess_fn(x) + hidden = torch.relu(self.fc1(x)) + latent = self.fc2(hidden) + if return_hidden: + return latent, hidden + return latent + + def compute_loss(self, x, y): + # WRONG: No type hints + """ + Compute the loss + # WRONG: No proper docstring structure + + Description: + # WRONG: "Description" is not a recognized section, should be in the main description + This method computes some loss value + + Arguments: + # WRONG: "Arguments" is not recognized, should be "Parameters" + x: first input (B, D) + # WRONG: Not using :math: for tensor shapes + y: second input + # WRONG: No shape information + + Output: + # WRONG: "Output" (singular) is not recognized, should be "Returns" for methods + loss value + # WRONG: No type or shape information + """ + return torch.nn.functional.mse_loss(x, y) + + def helper_method(self, x): + # WRONG: No docstring at all for method, docstrings for methods are mandatory + # WRONG: missing Parameters and Returns sections in the docstring + # WRONG: No type hints + return x * 2 +``` diff --git a/.cursor/rules/model-rule-004-organization-of-utility-functions-for-models.mdc b/.cursor/rules/model-rule-004-organization-of-utility-functions-for-models.mdc new file mode 100644 index 0000000000..5fb33ff462 --- /dev/null +++ b/.cursor/rules/model-rule-004-organization-of-utility-functions-for-models.mdc @@ -0,0 +1,66 @@ +--- +description: All utility functions specific to a model should be in the same module file as the model itself, not in separate utility files, to maintain self-contained modules. +alwaysApply: false +--- + +When organizing model code, rule MOD-004 must be followed. Explicitly reference "Following rule MOD-004" when deciding where to place utility functions. For example: "Following rule MOD-004, which states that all utility functions for a model class should be contained in the same module file as the model class itself, this helper function should be in my_model.py alongside the model class..." + +## MOD-004: Self-contained model modules + +**Description:** + +All utility functions for a model class should be contained in the same module +file as the model class itself. For a model called `MyModelName` in +`my_model_name.py`, all utility functions specific to that model should also be +in `my_model_name.py`. Utility functions should never be placed in separate +files like `my_model_name_utils.py` or `my_model_name/utils.py`. + +The only exception to this rule is when a utility function is used across +multiple models. In that case, the shared utility should be placed in an +appropriate shared module and imported in `my_model_name.py`. + +**Rationale:** + +Self-contained modules are easier to understand, maintain, and navigate. Having +all model-specific code in one place reduces cognitive load and makes it clear +which utilities are model-specific versus shared. This also simplifies code +reviews and reduces the likelihood of orphaned utility files when models are +refactored or removed. + +**Example:** + +```python +# Good: Utility function in the same file as the model +# File: physicsnemo/models/my_transformer.py + +def _compute_attention_mask(seq_length: int) -> torch.Tensor: + """Helper function specific to MyTransformer.""" + mask = torch.triu(torch.ones(seq_length, seq_length), diagonal=1) + return mask.masked_fill(mask == 1, float('-inf')) + +class MyTransformer(Module): + """A transformer model.""" + def forward(self, x: torch.Tensor) -> torch.Tensor: + mask = _compute_attention_mask(x.shape[1]) + return self._apply_attention(x, mask) +``` + +**Anti-pattern:** + +```python +# WRONG: Utility function in separate file +# File: physicsnemo/models/my_transformer_utils.py +def _compute_attention_mask(seq_length: int) -> torch.Tensor: + """Should be in my_transformer.py, not in a separate utils file.""" + mask = torch.triu(torch.ones(seq_length, seq_length), diagonal=1) + return mask.masked_fill(mask == 1, float('-inf')) + +# File: physicsnemo/models/my_transformer.py +from physicsnemo.models.my_transformer_utils import _compute_attention_mask # WRONG + +class MyTransformer(Module): + """A transformer model.""" + def forward(self, x: torch.Tensor) -> torch.Tensor: + mask = _compute_attention_mask(x.shape[1]) + return self._apply_attention(x, mask) +``` diff --git a/.cursor/rules/model-rule-005-tensor-arguments-shape-validation.mdc b/.cursor/rules/model-rule-005-tensor-arguments-shape-validation.mdc new file mode 100644 index 0000000000..3bb307e1ff --- /dev/null +++ b/.cursor/rules/model-rule-005-tensor-arguments-shape-validation.mdc @@ -0,0 +1,112 @@ +--- +description: All forward and public methods must validate tensor shapes at the beginning, wrapped in torch.compiler.is_compiling() guard, with standardized error messages. +alwaysApply: false +--- + +When implementing forward or public methods, rule MOD-005 must be followed. Explicitly reference "Following rule MOD-005" when adding validation code. For example: "Following rule MOD-005, which states that all forward methods and other public methods that accept tensor arguments must validate tensor shapes at the beginning of the method, ..." + +## MOD-005: Tensor shape validation + +**Description:** + +All forward methods and other public methods that accept tensor arguments must +validate tensor shapes at the beginning of the method. This rule applies to: +- Individual tensor arguments +- Containers of tensors (lists, tuples, dictionaries) + +For containers, validate their length, required keys, and the shapes of +contained tensors. Validation statements should be concise (ideally one check +per argument). Error messages must follow the standardized format: +`"Expected tensor of shape (B, D) but got tensor of shape {actual_shape}"`. + +To avoid interactions with `torch.compile`, all validation must be wrapped in a +conditional check using `torch.compiler.is_compiling()`. Follow the "fail-fast" +approach by validating inputs before any computation. + +**Rationale:** + +Early shape validation catches errors at the API boundary with clear, actionable +error messages, making debugging significantly easier. Without validation, shape +mismatches result in cryptic errors deep in the computation graph. The +`torch.compile` guard ensures that validation overhead is eliminated in +production compiled code while preserving debug-time safety. + +**Example:** + +```python +def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor: + """Forward pass with shape validation.""" + ### Input validation + # Skip validation when running under torch.compile for performance + if not torch.compiler.is_compiling(): + # Extract expected dimensions + B, C, H, W = x.shape if x.ndim == 4 else (None, None, None, None) + + # Validate x shape + if x.ndim != 4: + raise ValueError( + f"Expected 4D input tensor (B, C, H, W), got {x.ndim}D tensor with shape {tuple(x.shape)}" + ) + + if C != self.in_channels: + raise ValueError( + f"Expected {self.in_channels} input channels, got {C} channels" + ) + + # Validate optional mask + if mask is not None: + if mask.shape != (B, H, W): + raise ValueError( + f"Expected mask shape ({B}, {H}, {W}), got {tuple(mask.shape)}" + ) + + # Actual computation happens after validation + return self._process(x, mask) + +def process_list(self, tensors: List[torch.Tensor]) -> torch.Tensor: + """Process a list of tensors with validation.""" + ### Input validation + if not torch.compiler.is_compiling(): + if len(tensors) == 0: + raise ValueError("Expected non-empty list of tensors") + + # Validate all tensors have consistent shapes + ref_shape = tensors[0].shape + for i, t in enumerate(tensors[1:], start=1): + if t.shape != ref_shape: + raise ValueError( + f"All tensors must have the same shape. " + f"Tensor 0 has shape {tuple(ref_shape)}, " + f"but tensor {i} has shape {tuple(t.shape)}" + ) + + return torch.stack(tensors) +``` + +**Anti-pattern:** + +```python +# WRONG: No validation at all +def forward(self, x: torch.Tensor) -> torch.Tensor: + return self.layer(x) # Will fail with cryptic error if shape is wrong + +# WRONG: Validation not guarded by torch.compiler.is_compiling() +def forward(self, x: torch.Tensor) -> torch.Tensor: + if x.ndim != 4: # Breaks torch.compile + raise ValueError(f"Expected 4D tensor, got {x.ndim}D") + return self.layer(x) + +# WRONG: Validation after computation has started +def forward(self, x: torch.Tensor) -> torch.Tensor: + h = self.layer1(x) # Computation started + if x.shape[1] != self.in_channels: # Too late! + raise ValueError(f"Wrong number of channels") + return self.layer2(h) + +# WRONG: Non-standard error message format +def forward(self, x: torch.Tensor) -> torch.Tensor: + if not torch.compiler.is_compiling(): + if x.ndim != 4: + raise ValueError("Input must be 4D") # Missing actual shape info + return self.layer(x) +``` diff --git a/.cursor/rules/model-rule-006-tensor-type-and-shape-annotations-with-jaxtyping.mdc b/.cursor/rules/model-rule-006-tensor-type-and-shape-annotations-with-jaxtyping.mdc new file mode 100644 index 0000000000..fb10f03ec5 --- /dev/null +++ b/.cursor/rules/model-rule-006-tensor-type-and-shape-annotations-with-jaxtyping.mdc @@ -0,0 +1,74 @@ +--- +description: All tensor arguments in model methods must have jaxtyping type annotations with shape specifications for runtime-checkable shape information. +alwaysApply: false +--- + +When adding type hints to model methods, rule MOD-006 must be followed. Explicitly reference "Following rule MOD-006" when using jaxtyping annotations. For example: "Following rule MOD-006, which states that all tensor arguments and variables in model `__init__`, `forward`, and other public methods must have type annotations using `jaxtyping`, ..." + +## MOD-006: Jaxtyping annotations + +**Description:** + +All tensor arguments and variables in model `__init__`, `forward`, and other +public methods must have type annotations using `jaxtyping`. This provides +runtime-checkable shape information in type hints. + +Use the format `Float[torch.Tensor, "shape_spec"]` where shape_spec describes +tensor dimensions using space-separated dimension names (e.g., `"batch channels height width"` +or `"b c h w"`). + +**Rationale:** + +Jaxtyping annotations provide explicit, machine-readable documentation of +expected tensor shapes. This enables better IDE support, catches shape errors +earlier, and makes code more self-documenting. The annotations serve as both +documentation and optional runtime checks when jaxtyping's validation is +enabled. + +**Example:** + +```python +from jaxtyping import Float +import torch + +class MyConvNet(Module): + def __init__(self, in_channels: int, out_channels: int): + super().__init__() + self.conv = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3) + + def forward( + self, + x: Float[torch.Tensor, "batch in_channels height width"] + ) -> Float[torch.Tensor, "batch out_channels height width"]: + """Process input with convolution.""" + return self.conv(x) + +def process_attention( + query: Float[torch.Tensor, "batch seq_len d_model"], + key: Float[torch.Tensor, "batch seq_len d_model"], + value: Float[torch.Tensor, "batch seq_len d_model"] +) -> Float[torch.Tensor, "batch seq_len d_model"]: + """Compute attention with clear shape annotations.""" + pass +``` + +**Anti-pattern:** + +```python +# WRONG: No jaxtyping annotations +def forward(self, x: torch.Tensor) -> torch.Tensor: + return self.layer(x) + +# WRONG: Using plain comments instead of jaxtyping +def forward(self, x: torch.Tensor) -> torch.Tensor: + # x: (batch, channels, height, width) # Use jaxtyping instead + return self.layer(x) + +# WRONG: Incomplete annotations (missing jaxtyping for tensor arguments) +def forward( + self, + x: Float[torch.Tensor, "b c h w"], + mask: torch.Tensor # Missing jaxtyping annotation +) -> Float[torch.Tensor, "b c h w"]: + return self.layer(x, mask) +``` diff --git a/.cursor/rules/model-rule-007-backward-compatibility.mdc b/.cursor/rules/model-rule-007-backward-compatibility.mdc new file mode 100644 index 0000000000..41f049d800 --- /dev/null +++ b/.cursor/rules/model-rule-007-backward-compatibility.mdc @@ -0,0 +1,119 @@ +--- +description: Forbids changing model signatures without backward compatibility via version bumps, __supported_model_checkpoint_version__, and _backward_compat_arg_mapper implementations. +alwaysApply: false +--- + +When modifying existing production models, rule MOD-007 must be strictly followed. Explicitly reference "Following rule MOD-007" when explaining version changes. For example: "Following rule MOD-007, which states that for any model in `physicsnemo/nn` or `physicsnemo/models`, it is strictly forbidden to change the signature of `__init__`, any public method, or any public attribute without maintaining backward compatibility, incrementing __model_checkpoint_version__ and adding _backward_compat_arg_mapper to handle parameter rename..." + +## MOD-007: Backward compatibility + +**Description:** + +For any model in `physicsnemo/nn` or `physicsnemo/models`, it is strictly +forbidden to change the signature of `__init__`, any public method, or any +public attribute without maintaining backward compatibility. This includes: +- Adding new required parameters +- Removing parameters +- Renaming parameters +- Changing parameter types +- Changing return types + +If a signature change is absolutely necessary, the developer must: +1. Add a backward compatibility mapping in the model class +2. Increment the model version number +3. Maintain support for the old API for at least 2 release cycles +4. Add deprecation warnings for the old API + +**Rationale:** + +PhysicsNeMo is used in production environments and research code where +unexpected API changes can break critical workflows. Maintaining backward +compatibility ensures that users can upgrade to new versions without their code +breaking. Version numbers and compatibility mappings provide a clear migration +path when changes are necessary. + +**Example:** + +```python +from typing import Any, Dict, Optional + +# Good: Adding optional parameter with default value (backward compatible) +class MyModel(Module): + __model_checkpoint_version__ = "2.0" + __supported_model_checkpoint_version__ = { + "1.0": "Loading checkpoint from version 1.0 (current is 2.0). Still supported." + } + + def __init__( + self, + input_dim: int, + output_dim: int, + dropout: float = 0.0, # New parameter with default + new_feature: bool = False # New parameter with default + ): + super().__init__(meta=MyModelMetaData()) + # ... implementation + +# Good: Proper backward compatibility when parameter must be renamed +class MyModel(Module): + __model_checkpoint_version__ = "2.0" + __supported_model_checkpoint_version__ = { + "1.0": ( + "Loading MyModel checkpoint from version 1.0 (current is 2.0). " + "Parameter 'hidden_dim' has been renamed to 'hidden_size'. " + "Consider re-saving to upgrade to version 2.0." + ) + } + + @classmethod + def _backward_compat_arg_mapper( + cls, version: str, args: Dict[str, Any] + ) -> Dict[str, Any]: + """Map arguments from older versions to current version format.""" + # Call parent class method first + args = super()._backward_compat_arg_mapper(version, args) + + if version == "1.0": + # Map old parameter name to new name + if "hidden_dim" in args: + args["hidden_size"] = args.pop("hidden_dim") + + # Remove deprecated parameters that are no longer used + if "legacy_param" in args: + _ = args.pop("legacy_param") + + return args + + def __init__( + self, + input_dim: int, + output_dim: int, + hidden_size: int = 128, # New name (was 'hidden_dim' in v1.0) + ): + super().__init__(meta=MyModelMetaData()) + self.hidden_size = hidden_size + # ... implementation +``` + +**Anti-pattern:** + +```python +# WRONG: Changing parameter name without backward compatibility +class MyModel(Module): + __model_checkpoint_version__ = "2.0" + # Missing: __supported_model_checkpoint_version__ and _backward_compat_arg_mapper + + def __init__(self, input_dim: int, hidden_size: int): # Renamed from hidden_dim + super().__init__(meta=MyModelMetaData()) + # WRONG: Old checkpoints with 'hidden_dim' will fail to load! + +# WRONG: Not calling super() in _backward_compat_arg_mapper +class MyModel(Module): + @classmethod + def _backward_compat_arg_mapper(cls, version: str, args: Dict[str, Any]) -> Dict[str, Any]: + # WRONG: Missing super()._backward_compat_arg_mapper(version, args) + if version == "1.0": + if "hidden_dim" in args: + args["hidden_size"] = args.pop("hidden_dim") + return args +``` diff --git a/.cursor/rules/model-rule-008-ci-tests-for-model-classes.mdc b/.cursor/rules/model-rule-008-ci-tests-for-model-classes.mdc new file mode 100644 index 0000000000..d5c9f8edfb --- /dev/null +++ b/.cursor/rules/model-rule-008-ci-tests-for-model-classes.mdc @@ -0,0 +1,145 @@ +--- +description: Every model requires comprehensive CI tests - constructor/attributes, non-regression with reference data, and checkpoint loading - following test_layers_unet_block.py template. +alwaysApply: false +--- + +When creating new models or refactoring existing models, or adding tests for models, rule MOD-008 must be followed. Explicitly reference "Following rule MOD-008" when implementing tests. For example: "Following rule MOD-008, which states that every model in a module file `my_model_name.py` in `physicsnemo/nn` or `physicsnemo/models` must have corresponding tests in `test/models/test_.py`, ..." + +## MOD-008: Minimal CI testing requirements + +**Description:** + +Every model in a module file `my_model_name.py` in `physicsnemo/nn` or +`physicsnemo/models` must have corresponding tests in +`test/models/test_.py`. Tests should roughly follow a similar +template and, three types of tests are required: + +1. **Constructor and attribute tests**: Verify model instantiation and all + public attributes (excluding buffers and parameters). + +2. **Non-regression test with reference data**: Instantiate a model, run + forward pass, and compare outputs against reference data saved in a `.pth` + file. + +3. **Non-regression test from checkpoint**: Load a model from a checkpoint file + (`.mdlus`) and verify outputs match reference data. + +Additional requirements: +- All tests must use `pytest` parameterization syntax +- At least 2 configurations must be tested: one with all default arguments, one + with non-default arguments. More variations specific to some relevant + use-cases are also encouraged. +- Test tensors must have realistic shapes (e.g. no singleton dimensions) and + should be as meaningful and representative of actual use cases as possible. +- All public methods must have the same non-regression tests as the forward + method +- Simply checking output shapes is NOT sufficient - actual values must be + compared +- **Critical:** Per MOD-002, it is forbidden to move a model out of the + experimental stage/directory without these tests + +**Rationale:** + +Comprehensive tests ensure model correctness and prevent regressions as code +evolves. Non-regression tests with reference data catch subtle numerical changes +that could break reproducibility. Checkpoint tests verify serialization and +deserialization work correctly. Parameterized tests ensure models work across +different configurations. These tests are required before models can graduate +from experimental to production status. + +**Example:** + +```python +# Good: Following the test_layers_unet_block.py template +import pytest +import torch +from physicsnemo.models import MyModel + +def _instantiate_model(cls, seed: int = 0, **kwargs): + """Helper to create model with reproducible parameters.""" + model = cls(**kwargs) + gen = torch.Generator(device="cpu") + gen.manual_seed(seed) + with torch.no_grad(): + for param in model.parameters(): + param.copy_(torch.randn(param.shape, generator=gen, dtype=param.dtype)) + return model + +@pytest.mark.parametrize("device", ["cuda:0", "cpu"]) +@pytest.mark.parametrize( + "config", + ["default", "custom"], + ids=["with_defaults", "with_custom_args"] +) +def test_my_model_non_regression(device, config): + """Test model forward pass against reference output.""" + # Setup model configuration + if config == "default": + model = _instantiate_model(MyModel, input_dim=64, output_dim=32) + else: + model = _instantiate_model( + MyModel, + input_dim=64, + output_dim=32, + hidden_dim=256, + dropout=0.1 + ) + + model = model.to(device) + + # Test constructor and attributes + assert model.input_dim == 64 + assert model.output_dim == 32 + if config == "custom": + assert model.hidden_dim == 256 + assert model.dropout == 0.1 + + # Load reference data (meaningful shapes, no singleton dimensions) + data = torch.load(f"test/models/data/my_model_{config}_v1.0.pth") + x = data["x"].to(device) # Shape: (4, 64), not (1, 64) + out_ref = data["out"].to(device) + + # Run forward and compare + out = model(x) + assert torch.allclose(out, out_ref, atol=1e-5, rtol=1e-5) + +@pytest.mark.parametrize("device", ["cuda:0", "cpu"]) +def test_my_model_from_checkpoint(device): + """Test loading model from checkpoint and verify outputs.""" + model = physicsnemo.Module.from_checkpoint( + "test/models/data/my_model_default_v1.0.mdlus" + ).to(device) + + # Test attributes + assert model.input_dim == 64 + assert model.output_dim == 32 + + # Load reference and verify + data = torch.load("test/models/data/my_model_default_v1.0.pth") + x = data["x"].to(device) + out_ref = data["out"].to(device) + out = model(x) + assert torch.allclose(out, out_ref, atol=1e-5, rtol=1e-5) +``` + +**Anti-pattern:** + +```python +# WRONG: Only testing output shapes +def test_my_model_bad(device): + model = MyModel(input_dim=64, output_dim=32).to(device) + x = torch.randn(4, 64).to(device) + out = model(x) + assert out.shape == (4, 32) # NOT SUFFICIENT! + +# WRONG: Using singleton dimensions in test data +def test_my_model_bad(device): + x = torch.randn(1, 1, 64) # WRONG: Trivial shapes hide bugs + +# WRONG: No parameterization +def test_my_model_bad(): + model = MyModel(input_dim=64, output_dim=32) # Only tests defaults + +# WRONG: No checkpoint loading test +# (Missing test_my_model_from_checkpoint entirely) +``` diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index e193e8eccc..63b2542f0a 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -13,6 +13,7 @@ - [ ] The documentation is up to date with these changes. - [ ] The [CHANGELOG.md](https://github.com/NVIDIA/physicsnemo/blob/main/CHANGELOG.md) is up to date with these changes. - [ ] An [issue](https://github.com/NVIDIA/physicsnemo/issues) is linked to this pull request. +- [ ] If I am implementing a new model or modifying any existing model, I have followed the [Models Implementation Coding Standards](https://github.com/NVIDIA/physicsnemo/wiki/Coding-standards-for-models-implementation). ## Dependencies diff --git a/CODING_STANDARDS/MODELS_IMPLEMENTATION.md b/CODING_STANDARDS/MODELS_IMPLEMENTATION.md new file mode 100644 index 0000000000..d275bb10cf --- /dev/null +++ b/CODING_STANDARDS/MODELS_IMPLEMENTATION.md @@ -0,0 +1,1155 @@ + + + + + + + + + + + +# MODELS_IMPLEMENTATION - Coding Standards + +## Overview + +This document defines the coding standards and best practices for implementing +model classes in the PhysicsNeMo repository. These rules are designed to ensure +consistency, maintainability, and high code quality across all model +implementations. + +**Important:** These rules are enforced as strictly as possible. Deviations +from these standards should only be made when absolutely necessary and must be +documented with clear justification in code comments and approved during code +review. + +## Document Organization + +This document is structured in two main sections: + +1. **Rule Index**: A quick-reference table listing all rules with their IDs, + one-line summaries, and the context in which they apply. Use this section + to quickly identify relevant rules when implementing or reviewing code. + +2. **Detailed Rules**: Comprehensive descriptions of each rule, including: + - Clear descriptions of what the rule requires + - Rationale explaining why the rule exists + - Examples demonstrating correct implementation + - Anti-patterns showing common mistakes to avoid + +## How to Use This Document + +- **When creating new models**: Review all rules before starting implementation, + paying special attention to rules MOD-000 through MOD-003. +- **When reviewing code**: Use the Rule Index to quickly verify compliance with + all applicable rules. +- **When refactoring**: Ensure refactored code maintains or improves compliance + with these standards. +- **For AI agents that generate code**: This document is formatted for easy parsing. Each rule has + a unique ID and structured sections (Description, Rationale, Example, + Anti-pattern) that can be extracted and used as context. When generating code + based on a rule, an AI agent should explicitly quote the rule ID that it is + following, and explicitly quote the relevant extract from the rule that it is + using as context. For example, "Following rule MOD-000, the new model class + should be ..." +- **For AI agents that review code**: When reviewing code, the AI agent should + explicitly identify which rules are violated by the code, and provide a clear + explanation of why the code violates the rule. The AI agent should explicitly + quote the rule ID that the code is violating, and explicitly quote the relevant + extract from the rule that it is using as context. For example, "Code violates + rule MOD-000, because the new model class is not..." + +## Rule Index + +| Rule ID | Summary | Apply When | +|---------|---------|------------| +| [`MOD-000`](#mod-000-models-organization-and-where-to-place-new-model-classes) | Models organization and where to place new model classes | Creating or refactoring new model classes | +| [`MOD-001`](#mod-001-use-proper-class-inheritance-for-all-models) | Use proper class inheritance for all models | Creating or refactoring new model classes | +| [`MOD-002`](#mod-002-model-classes-lifecycle) | Model classes lifecycle | Creating or moving existing model classes | +| [`MOD-003`](#mod-003-model-classes-documentation) | Model classes documentation | Creating or editing any docstring in a model class | +| [`MOD-004`](#mod-004-self-contained-model-modules) | Keep utility functions in the same module as the model | Organizing or refactoring model code | +| [`MOD-005`](#mod-005-tensor-shape-validation) | Validate tensor shapes in forward and public methods | Implementing or modifying model forward or public methods | +| [`MOD-006`](#mod-006-jaxtyping-annotations) | Use jaxtyping for tensor type annotations | Adding or editing any new public method of a model class | +| [`MOD-007`](#mod-007-backward-compatibility) | Maintain backward compatibility for model signatures | Modifying existing production models | +| [`MOD-008`](#mod-008-minimal-ci-testing-requirements) | Provide comprehensive CI tests for all models | Moving models out of experimental or adding new models | + +--- + +## Detailed Rules + +### MOD-000: Models organization and where to place new model classes + +**Description:** + +There are two types of models in PhysicsNeMo: + +- Reusable layers that are the building blocks of more complex architectures. + Those should go into `physicsnemo/nn`. Those include for instance + `FullyConnected`, various variants of attention layers, `UNetBlock` (a block + of a U-Net), etc. + All layers that are directly exposed to the user should be imported in + `physicsnemo/nn/__init__.py`, such that they can be used as follows: + ```python + from physicsnemo.nn import MyLayer + ``` +- More complete models, composed of multiple layers and/or other sub-models. + Those should go into `physicsnemo/models`. All models that are directly + exposed to the user should be imported in `physicsnemo/models/__init__.py`, + such that they can be used as follows: + ```python + from physicsnemo.models import MyModel + ``` + +The only exception to this rule is for models or layers that are highly specific to a +single example. In this case, it may be acceptable to place them in a module +specific to the example code, such as for example +`examples//utils/nn.py`. + +**Rationale:** +Ensures consistency and clarity in the organization of models in the +repository, in particular a clear separation between reusable layers and more +complete models that are applicable to a specific domain or specific data +modality. + +**Example:** + +```python +# Good: Reusable layer in physicsnemo/nn/attention.py +class MultiHeadAttention(Module): + """A reusable attention layer that can be used in various architectures.""" + pass + +# Good: Complete model in physicsnemo/models/transformer.py +class TransformerModel(Module): + """A complete transformer model composed of attention and feedforward layers.""" + def __init__(self): + super().__init__() + self.attention = MultiHeadAttention(...) + self.ffn = FeedForward(...) + +# Good: Example-specific utility in examples/weather/utils/nn.py +class WeatherSpecificLayer(Module): + """Layer highly specific to the weather forecasting example.""" + pass +``` + +**Anti-pattern:** + +```python +# WRONG: Complete model placed in physicsnemo/nn/ instead of physicsnemo/models/ +# File: physicsnemo/nn/transformer.py +class TransformerModel(Module): + """Should be in physicsnemo/models/ not physicsnemo/nn/""" + pass + +# WRONG: Reusable layer placed in physicsnemo/models/ instead of physicsnemo/nn/ +# File: physicsnemo/models/attention.py +class MultiHeadAttention(Module): + """Should be in physicsnemo/nn/ not physicsnemo/models/""" + pass +``` + +--- + +### MOD-001: Use proper class inheritance for all models + +**Description:** +All model classes must inherit from `physicsnemo.Module`. Direct subclasses of +`torch.nn.Module` are not allowed. Direct subclasses of `physicsnemo.Module` +are allowed (note that `physicsnemo.Module` is a subclass of `torch.nn.Module`). +Ensure proper initialization of parent classes using `super().__init__()`. Pass +the `meta` argument to the `super().__init__()` call if appropriate, otherwise +set it manually with `self.meta = meta`. + +**Rationale:** +Ensures invariants and functionality of the `physicsnemo.Module` class for all +models. In particular, instances of `physicsnemo.Module` benefit from features +that are not available in `torch.nn.Module` instances. Those include serialization +for checkpointing and loading modules and submodules, versioning system to +handle backward compatibility, as well as ability to be registered in the +`physicsnemo.registry` for easy instantiation and use in any codebase. + +**Example:** + +```python +from physicsnemo import Module + +class MyModel(Module): + def __init__(self, input_dim: int, output_dim: int): + super().__init__(meta=MyModelMetaData()) + self.linear = nn.Linear(input_dim, output_dim) +``` + +**Anti-pattern:** + +```python +from torch import nn + +class MyModel(nn.Module): + def __init__(self, input_dim: int, output_dim: int): + self.linear = nn.Linear(input_dim, output_dim) +``` + +--- + +### MOD-002: Model classes lifecycle + +**Description:** +All model classes must follow the following lifecycle: + +- Stage 1: Creation. This is the stage where the model class is created. For + the vast majority of models, new classes are created either in + `physicsnemo/experimental/nn` for reusable layers, or in + `physicsnemo/experimental/models` for more complete models. The `experimental` + folder is used to store models that are still under development (beta or + alpha releases) during this stage, backward compatibility is not guaranteed. + One exception is when the developer is highly confident that the model + is sufficiently mature and applicable to many domains or use cases. In this + case the model class can be created in the `physicsnemo/nn` or `physicsnemo/models` + folders directly, and backward compatibility is guaranteed. Another exception + is when the model class is highly specific to a single example. In this case, + it may be acceptable to place it in a module specific to the example code, + such as for example `examples//utils/nn.py`. + +- Stage 2: Production. After staying in stage 1 for a sufficient amount of time + (typically at least 1 release cycle), the model class is promoted to stage 2. + It is then moved to the `physicsnemo/nn` or `physicsnemo/models` folders, + based on the rule `MOD-000`. During this stage, backward compatibility is + guaranteed. + +- Stage 3: Pre-deprecation. For a model class in stage 3 in `physicsnemo/nn` or + `physicsnemo/models`, the developer should start planning its deprecation. + This is done by adding a warning message to the model class, indicating that + the model class is deprecated and will be removed in a future release. The + warning message should be a clear and concise message that explains why the + model class is being deprecated and what the user should do instead. The + deprecation message should be added to both the docstring and should be + raised at runtime. The developer is free to choose the mechanism to raise the + deprecation warning. A model class cannot be deprecated without staying in + stage 3 "pre-deprecation" for at least 1 release cycle. + +- Stage 4: Deprecation. After staying in stage 3 "pre-deprecation" for at least 1 + release cycle, the model class is deprecated. It can be deleted from the + codebase. + +**Rationale:** +This lifecycle ensures a structured approach to model development and maintenance. +The experimental stage allows rapid iteration without backward compatibility +constraints, enabling developers to refine APIs based on user feedback. The +production stage provides stability for users who depend on these models. The +pre-deprecation and deprecation stages ensure users have sufficient time to +migrate to newer alternatives, preventing breaking changes that could disrupt +their workflows. This graduated approach balances innovation with stability, +a critical requirement for a scientific computing framework. + +**Example:** + +```python +# Good: Stage 1 - New experimental model +# File: physicsnemo/experimental/models/new_diffusion.py +class DiffusionModel(Module): + """New diffusion model under active development. API may change.""" + pass + +# Good: Stage 2 - Promoted to production after 1 release cycle +# File: physicsnemo/models/diffusion.py (moved from experimental/) +class DiffusionModel(Module): + """Stable diffusion model with backward compatibility guarantees.""" + pass + +# Good: Stage 3 - Pre-deprecation with warning +# File: physicsnemo/models/old_diffusion.py +class DiffusionModel(Module): + """ + Legacy diffusion model. + + .. deprecated:: 0.5.0 + ``OldDiffusionModel`` is deprecated and will be removed in version 0.7.0. + Use :class:`~physicsnemo.models.NewDiffusionModel` instead. + """ + def __init__(self): + import warnings + warnings.warn( + "OldDiffusionModel is deprecated. Use DiffusionModel instead.", + DeprecationWarning, + stacklevel=2 + ) + super().__init__() + +# Good: Stage 4 - Model removed after deprecation period +# (File deleted from codebase) +``` + +**Anti-pattern:** + +```python +# WRONG: New model directly in production folder without experimental phase +# File: physicsnemo/models/brand_new_model.py (should be in experimental/ first) +class BrandNewModel(Module): + """Skipped experimental stage - risky for stability""" + pass + +# WRONG: Breaking changes in production without deprecation cycle +# File: physicsnemo/models/diffusion.py +class DiffusionModel(Module): + def __init__(self, new_required_param): # Breaking change! + # Changed API without deprecation warning - breaks user code + pass + +# WRONG: Deprecation without sufficient warning period +# (Model deprecated and removed in same release) + +# WRONG: No deprecation warning in code +# File: physicsnemo/models/old_model.py +class OldModel(Module): + """Will be removed next release.""" # Docstring mentions it but no runtime warning + def __init__(self): + # Missing: warnings.warn(..., DeprecationWarning) + super().__init__() +``` + +--- + +### MOD-003: Model classes documentation + +**Description:** + +Every new model or modification of any model code should be documented with a +comprehensive docstring. Each method of the model class should be documented with a +docstring as well. The forward method should be documented in the docstring of the +model class, instead of being in the docstring of the forward method itself. A +docstring for the forward is still possible but it should be concise and to the +point. To document the forward method, use the sections `Forward` and +`Outputs`. In addition, all docstrings should be written in the NumPy style, +and adopt formatting to be compatible with our Sphinx restructured text (RST) +documentation. +The docstrings should follow the following requirements: + +- Each docstring should be prefixed with `r"""`. + +- The class docstring should at least contain three sections: `Parameters`, + `Forward`, and `Outputs`. Other sections such as `Notes`, `Examples`, + or `..important::` or `..code-block:: python` are possible. Other sections + are not recognized by our Sphinx documentation and are prohibited. + +- All methods should be documented with a docstring, with at least a `Parameters` + section and a `Returns` section. Other sections such as `Notes`, `Examples`, + or `..important::` or `..code-block:: python` are possible. Other sections + are not recognized by our Sphinx documentation and are prohibited. + +- All tensors should be documented with their shape, using LaTeX math notation such + as :math:`(N, C, H_{in}, W_{in})` (there is flexibility for naming the + dimensions, but the math format should be enforced). Our documentation is + rendered using LaTeX, and supports a rich set of LaTeX commands, so + it is recommended to use LaTeX commands whenever possible for mathematical + variables in the docstrings. The mathematical notations should be to some degree + consistent with the actual variable names in the code (even though + that is not always possible, to avoid too complex formatting). + +- For arguments or variables that are callback functions, (e.g. Callable), the + docstring should include a clear separated ..code-block:: that specifies the + required signature and return type of the callback function. This is not only + true for callback functions, but for any type of parameters or arguments that + has some complex type specification or API requirements. The explanation code + block should be placed in the top or bottom section of the docstrings, but + not in the `Parameters` or `Forward` or `Outputs` sections, for readability + and clarity. + +- Inline code should be formatted double backticks, such as ``my_variable``. + Single backticks are not allowed as they don't render properly in our Sphinx + documentation. + +- All parameters should be documented with their type and default values on a + single line. + +- When possible, docstrings should use links to other docstrings, such as + :class:`~physicsnemo.models.some_model.SomeModel`, or + :func:`~physicsnemo.utils.common_function`, or + :meth:`~physicsnemo.models.some_model.SomeModel.some_method`. + +- When referencing external resources, such as papers, websites, or other + documentation, docstrings should use links to the external resource in the + format `some link text `_. + +- Docstrings are strongly encouraged to have an `Examples` section that + demonstrates basic construction and usage of the model. These example sections + serve as both documentation and tests, as our CI system automatically tests + these code sections for correctness when present. Examples should be + executable Python code showing typical use cases, including model + instantiation, input preparation, and forward pass execution. + +**Rationale:** +Comprehensive and well-formatted documentation is essential for scientific +software. It enables users to understand model capabilities, expected inputs, +and outputs without inspecting source code. LaTeX math notation and proper +formatting ensure documentation renders correctly in Sphinx, creating +professional, publication-quality documentation. Consistent documentation +standards facilitate automatic documentation generation, improve code +discoverability, and help AI agents understand code context. For a framework +used in scientific research, clear documentation of tensor shapes, mathematical +formulations, and API contracts is critical for reproducibility and correct usage. + +**Example:** + +```python +from typing import Callable, Optional +from physicsnemo.models import Module +import torch + +class SimpleEncoder(Module): + r""" + # Rule: Docstring starts with r (for raw string) followed by three double quotes for proper LaTeX rendering + A simple encoder network that transforms input features to a latent representation. + + This model applies a sequence of linear transformations with activation functions + to encode input data into a lower-dimensional latent space. The architecture is + based on the approach described in `Autoencoder Networks `_. + # Rule: External references use proper link format + + The model supports custom preprocessing via a callback function that is applied + to the input before encoding. + + .. code-block:: python + + # Rule: Callback functions or complex API requirements documented in a concise code-block + def preprocess_fn(x: torch.Tensor) -> torch.Tensor: + ... + return y + + where ``x`` is the input tensor of shape :math:`(B, D_{in})` and ``y`` is the output tensor of shape :math:`(B, D_{out})`. + + Parameters + ---------- + # Rule: Parameters section is mandatory + input_dim : int + # Rule: Type and description on single line + Dimension of input features. + latent_dim : int + Dimension of the latent representation. + hidden_dim : int, optional, default=128 + # Rule: Default values are documented on same line + Dimension of the hidden layer. + activation : str, optional, default="relu" + Activation function to use. See :func:`~torch.nn.functional.relu` for details. + # Rule: Cross-references use proper Sphinx syntax + preprocess_fn : Callable[[torch.Tensor], torch.Tensor], optional, default=None + Optional preprocessing function applied to input. See the code block above + for the required signature. + # Rule: Callback functions documented with reference to code-block + + Forward + # Rule: Forward section documents the forward method in class docstring + ------- + x : torch.Tensor + Input tensor of shape :math:`(B, D_{in})` where :math:`B` is batch size + # Rule: Tensor shapes use LaTeX math notation with :math: + and :math:`D_{in}` is input dimension. + return_hidden : bool, optional, default=False + If ``True``, also returns hidden layer activations. + # Rule: Inline code uses double backticks + + Outputs + # Rule: Outputs section is mandatory + ------- + torch.Tensor or tuple + If ``return_hidden`` is ``False``, returns latent representation of shape + :math:`(B, D_{latent})`. If ``True``, returns tuple of + (latent, hidden) where hidden has shape :math:`(B, D_{hidden})`. + + Examples + # Rule: Examples section is allowed and helpful + -------- + >>> model = SimpleEncoder(input_dim=784, latent_dim=64) + >>> x = torch.randn(32, 784) + >>> latent = model(x) + >>> latent.shape + torch.Size([32, 64]) + + Notes + # Rule: Notes section is allowed for additional context + ----- + This encoder can be used as part of a larger autoencoder architecture + by combining it with a decoder network such as + :class:`~physicsnemo.models.decoder.SimpleDecoder`. + # Rule: Cross-references to other classes use proper Sphinx syntax + """ + + def __init__( + self, + input_dim: int, + latent_dim: int, + hidden_dim: int = 128, + activation: str = "relu", + preprocess_fn: Optional[Callable[[torch.Tensor], torch.Tensor]] = None + ): + super().__init__(meta=SimpleEncoderMetaData()) + self.fc1 = torch.nn.Linear(input_dim, hidden_dim) + self.fc2 = torch.nn.Linear(hidden_dim, latent_dim) + self.preprocess_fn = preprocess_fn + + def forward(self, x: torch.Tensor, return_hidden: bool = False): + """Concise forward docstring referencing class docstring for details.""" + # Rule: Forward method can have concise docstring, main docs in class + if self.preprocess_fn is not None: + x = self.preprocess_fn(x) + hidden = torch.relu(self.fc1(x)) + latent = self.fc2(hidden) + if return_hidden: + return latent, hidden + return latent + + def compute_reconstruction_loss( + self, + latent: torch.Tensor, + target: torch.Tensor, + ) -> torch.Tensor: + r""" + Compute mean squared error between latent representation and target. + # Rule: Methods have their own docstrings with Parameters and Returns sections + + Parameters + ---------- + # Rule: Parameters section is mandatory for methods + latent : torch.Tensor + Latent representation of shape :math:`(B, D_{latent})`. + # Rule: Tensor shapes documented with :math: + target : torch.Tensor + Target tensor of shape :math:`(B, D_{latent})`. + + Returns + ------- + # Rule: Returns section is mandatory for methods (not "Outputs" - that's for forward) + torch.Tensor + Scalar loss value. + """ + return torch.nn.functional.mse_loss(latent, target) +``` + +**Anti-pattern:** + +```python +from physicsnemo.models import Module +import torch + +class BadEncoder(Module): + ''' + # WRONG: Should use r (for raw string) followed by three double quotes for docstrings not three single quotes + A simple encoder network + # WRONG: missing Parameters, Forward, or Outputs sections in the docstring + # WRONG: callback function preprocess_fn is not documented at all + ''' + def __init__(self, input_dim, latent_dim, hidden_dim=128, preprocess_fn=None): + # WRONG: No type hints in signature + # WRONG: preprocess_fn callback parameter is missing from docstring entirely + super().__init__() + self.fc1 = torch.nn.Linear(input_dim, hidden_dim) + self.fc2 = torch.nn.Linear(hidden_dim, latent_dim) + self.preprocess_fn = preprocess_fn + + def forward(self, x, return_hidden=False): + # WRONG: No type hints + """ + Forward pass of the encoder. + # WRONG: Should be brief, main docs for the forward method should be in the class docstring + + Args: + x: input tensor with shape (B, D_in) + # WRONG: Using "Args" instead of NumPy-style "Parameters" + # WRONG: Not using :math: with backticks for shapes and other mathematical notations + return_hidden (bool): whether to return hidden state + # WRONG: Mixing documentation styles + + Returns: + encoded representation + # WRONG: Using "Returns" instead of "Outputs". The "Returns" section is for methods other than the forward method. The forward method should have an "Outputs" section instead. + # WRONG: No shape information + """ + # WRONG: Entire forward signature and behavior should be in class docstring + if self.preprocess_fn is not None: + x = self.preprocess_fn(x) + hidden = torch.relu(self.fc1(x)) + latent = self.fc2(hidden) + if return_hidden: + return latent, hidden + return latent + + def compute_loss(self, x, y): + # WRONG: No type hints + """ + Compute the loss + # WRONG: No proper docstring structure + + Description: + # WRONG: "Description" is not a recognized section, should be in the main description + This method computes some loss value + + Arguments: + # WRONG: "Arguments" is not recognized, should be "Parameters" + x: first input (B, D) + # WRONG: Not using :math: for tensor shapes + y: second input + # WRONG: No shape information + + Output: + # WRONG: "Output" (singular) is not recognized, should be "Returns" for methods + loss value + # WRONG: No type or shape information + """ + return torch.nn.functional.mse_loss(x, y) + + def helper_method(self, x): + # WRONG: No docstring at all for method, docstrings for methods are mandatory + # WRONG: missing Parameters and Returns sections in the docstring + # WRONG: No type hints + return x * 2 +``` + +--- + +### MOD-004: Self-contained model modules + +**Description:** + +All utility functions for a model class should be contained in the same module +file as the model class itself. For a model called `MyModelName` in +`my_model_name.py`, all utility functions specific to that model should also be +in `my_model_name.py`. Utility functions should never be placed in separate +files like `my_model_name_utils.py` or `my_model_name/utils.py`. + +The only exception to this rule is when a utility function is used across +multiple models. In that case, the shared utility should be placed in an +appropriate shared module and imported in `my_model_name.py`. + +**Rationale:** + +Self-contained modules are easier to understand, maintain, and navigate. Having +all model-specific code in one place reduces cognitive load and makes it clear +which utilities are model-specific versus shared. This also simplifies code +reviews and reduces the likelihood of orphaned utility files when models are +refactored or removed. + +**Example:** + +```python +# Good: Utility function in the same file as the model +# File: physicsnemo/models/my_transformer.py + +def _compute_attention_mask(seq_length: int) -> torch.Tensor: + """Helper function specific to MyTransformer.""" + mask = torch.triu(torch.ones(seq_length, seq_length), diagonal=1) + return mask.masked_fill(mask == 1, float('-inf')) + +class MyTransformer(Module): + """A transformer model.""" + def forward(self, x: torch.Tensor) -> torch.Tensor: + mask = _compute_attention_mask(x.shape[1]) + return self._apply_attention(x, mask) +``` + +**Anti-pattern:** + +```python +# WRONG: Utility function in separate file +# File: physicsnemo/models/my_transformer_utils.py +def _compute_attention_mask(seq_length: int) -> torch.Tensor: + """Should be in my_transformer.py, not in a separate utils file.""" + mask = torch.triu(torch.ones(seq_length, seq_length), diagonal=1) + return mask.masked_fill(mask == 1, float('-inf')) + +# File: physicsnemo/models/my_transformer.py +from physicsnemo.models.my_transformer_utils import _compute_attention_mask # WRONG + +class MyTransformer(Module): + """A transformer model.""" + def forward(self, x: torch.Tensor) -> torch.Tensor: + mask = _compute_attention_mask(x.shape[1]) + return self._apply_attention(x, mask) +``` + +--- + +### MOD-005: Tensor shape validation + +**Description:** + +All forward methods and other public methods that accept tensor arguments must +validate tensor shapes at the beginning of the method. This rule applies to: +- Individual tensor arguments +- Containers of tensors (lists, tuples, dictionaries) + +For containers, validate their length, required keys, and the shapes of +contained tensors. Validation statements should be concise (ideally one check +per argument). Error messages must follow the standardized format: +`"Expected tensor of shape (B, D) but got tensor of shape {actual_shape}"`. + +To avoid interactions with `torch.compile`, all validation must be wrapped in a +conditional check using `torch.compiler.is_compiling()`. Follow the "fail-fast" +approach by validating inputs before any computation. + +**Rationale:** + +Early shape validation catches errors at the API boundary with clear, actionable +error messages, making debugging significantly easier. Without validation, shape +mismatches result in cryptic errors deep in the computation graph. The +`torch.compile` guard ensures that validation overhead is eliminated in +production compiled code while preserving debug-time safety. + +**Example:** + +```python +def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor: + """Forward pass with shape validation.""" + ### Input validation + # Skip validation when running under torch.compile for performance + if not torch.compiler.is_compiling(): + # Extract expected dimensions + B, C, H, W = x.shape if x.ndim == 4 else (None, None, None, None) + + # Validate x shape + if x.ndim != 4: + raise ValueError( + f"Expected 4D input tensor (B, C, H, W), got {x.ndim}D tensor with shape {tuple(x.shape)}" + ) + + if C != self.in_channels: + raise ValueError( + f"Expected {self.in_channels} input channels, got {C} channels" + ) + + # Validate optional mask + if mask is not None: + if mask.shape != (B, H, W): + raise ValueError( + f"Expected mask shape ({B}, {H}, {W}), got {tuple(mask.shape)}" + ) + + # Actual computation happens after validation + return self._process(x, mask) + +def process_list(self, tensors: List[torch.Tensor]) -> torch.Tensor: + """Process a list of tensors with validation.""" + ### Input validation + if not torch.compiler.is_compiling(): + if len(tensors) == 0: + raise ValueError("Expected non-empty list of tensors") + + # Validate all tensors have consistent shapes + ref_shape = tensors[0].shape + for i, t in enumerate(tensors[1:], start=1): + if t.shape != ref_shape: + raise ValueError( + f"All tensors must have the same shape. " + f"Tensor 0 has shape {tuple(ref_shape)}, " + f"but tensor {i} has shape {tuple(t.shape)}" + ) + + return torch.stack(tensors) +``` + +**Anti-pattern:** + +```python +# WRONG: No validation at all +def forward(self, x: torch.Tensor) -> torch.Tensor: + return self.layer(x) # Will fail with cryptic error if shape is wrong + +# WRONG: Validation not guarded by torch.compiler.is_compiling() +def forward(self, x: torch.Tensor) -> torch.Tensor: + if x.ndim != 4: # Breaks torch.compile + raise ValueError(f"Expected 4D tensor, got {x.ndim}D") + return self.layer(x) + +# WRONG: Validation after computation has started +def forward(self, x: torch.Tensor) -> torch.Tensor: + h = self.layer1(x) # Computation started + if x.shape[1] != self.in_channels: # Too late! + raise ValueError(f"Wrong number of channels") + return self.layer2(h) + +# WRONG: Non-standard error message format +def forward(self, x: torch.Tensor) -> torch.Tensor: + if not torch.compiler.is_compiling(): + if x.ndim != 4: + raise ValueError("Input must be 4D") # Missing actual shape info + return self.layer(x) +``` + +--- + +### MOD-006: Jaxtyping annotations + +**Description:** + +All tensor arguments and variables in model `__init__`, `forward`, and other +public methods must have type annotations using `jaxtyping`. This provides +runtime-checkable shape information in type hints. + +Use the format `Float[torch.Tensor, "shape_spec"]` where shape_spec describes +tensor dimensions using space-separated dimension names (e.g., `"batch channels height width"` +or `"b c h w"`). + +**Rationale:** + +Jaxtyping annotations provide explicit, machine-readable documentation of +expected tensor shapes. This enables better IDE support, catches shape errors +earlier, and makes code more self-documenting. The annotations serve as both +documentation and optional runtime checks when jaxtyping's validation is +enabled. + +**Example:** + +```python +from jaxtyping import Float +import torch + +class MyConvNet(Module): + def __init__(self, in_channels: int, out_channels: int): + super().__init__() + self.conv = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3) + + def forward( + self, + x: Float[torch.Tensor, "batch in_channels height width"] + ) -> Float[torch.Tensor, "batch out_channels height width"]: + """Process input with convolution.""" + return self.conv(x) + +def process_attention( + query: Float[torch.Tensor, "batch seq_len d_model"], + key: Float[torch.Tensor, "batch seq_len d_model"], + value: Float[torch.Tensor, "batch seq_len d_model"] +) -> Float[torch.Tensor, "batch seq_len d_model"]: + """Compute attention with clear shape annotations.""" + pass +``` + +**Anti-pattern:** + +```python +# WRONG: No jaxtyping annotations +def forward(self, x: torch.Tensor) -> torch.Tensor: + return self.layer(x) + +# WRONG: Using plain comments instead of jaxtyping +def forward(self, x: torch.Tensor) -> torch.Tensor: + # x: (batch, channels, height, width) # Use jaxtyping instead + return self.layer(x) + +# WRONG: Incomplete annotations (missing jaxtyping for tensor arguments) +def forward( + self, + x: Float[torch.Tensor, "b c h w"], + mask: torch.Tensor # Missing jaxtyping annotation +) -> Float[torch.Tensor, "b c h w"]: + return self.layer(x, mask) +``` + +--- + +### MOD-007: Backward compatibility + +**Description:** + +For any model in `physicsnemo/nn` or `physicsnemo/models`, it is strictly +forbidden to change the signature of `__init__`, any public method, or any +public attribute without maintaining backward compatibility. This includes: +- Adding new required parameters +- Removing parameters +- Renaming parameters +- Changing parameter types +- Changing return types + +If a signature change is absolutely necessary, the developer must: +1. Add a backward compatibility mapping in the model class +2. Increment the model version number +3. Maintain support for the old API for at least 2 release cycles +4. Add deprecation warnings for the old API + +**Rationale:** + +PhysicsNeMo is used in production environments and research code where +unexpected API changes can break critical workflows. Maintaining backward +compatibility ensures that users can upgrade to new versions without their code +breaking. Version numbers and compatibility mappings provide a clear migration +path when changes are necessary. + +**Example:** + +```python +from typing import Any, Dict, Optional + +# Good: Adding optional parameter with default value (backward compatible) +class MyModel(Module): + __model_checkpoint_version__ = "2.0" + __supported_model_checkpoint_version__ = { + "1.0": "Loading checkpoint from version 1.0 (current is 2.0). Still supported." + } + + def __init__( + self, + input_dim: int, + output_dim: int, + dropout: float = 0.0, # New parameter with default + new_feature: bool = False # New parameter with default + ): + super().__init__(meta=MyModelMetaData()) + # ... implementation + +# Good: Proper backward compatibility when parameter must be renamed +class MyModel(Module): + __model_checkpoint_version__ = "2.0" + __supported_model_checkpoint_version__ = { + "1.0": ( + "Loading MyModel checkpoint from version 1.0 (current is 2.0). " + "Parameter 'hidden_dim' has been renamed to 'hidden_size'. " + "Consider re-saving to upgrade to version 2.0." + ) + } + + @classmethod + def _backward_compat_arg_mapper( + cls, version: str, args: Dict[str, Any] + ) -> Dict[str, Any]: + """Map arguments from older versions to current version format.""" + # Call parent class method first + args = super()._backward_compat_arg_mapper(version, args) + + if version == "1.0": + # Map old parameter name to new name + if "hidden_dim" in args: + args["hidden_size"] = args.pop("hidden_dim") + + # Remove deprecated parameters that are no longer used + if "legacy_param" in args: + _ = args.pop("legacy_param") + + return args + + def __init__( + self, + input_dim: int, + output_dim: int, + hidden_size: int = 128, # New name (was 'hidden_dim' in v1.0) + ): + super().__init__(meta=MyModelMetaData()) + self.hidden_size = hidden_size + # ... implementation +``` + +**Anti-pattern:** + +```python +# WRONG: Changing parameter name without backward compatibility +class MyModel(Module): + __model_checkpoint_version__ = "2.0" + # Missing: __supported_model_checkpoint_version__ and _backward_compat_arg_mapper + + def __init__(self, input_dim: int, hidden_size: int): # Renamed from hidden_dim + super().__init__(meta=MyModelMetaData()) + # WRONG: Old checkpoints with 'hidden_dim' will fail to load! + +# WRONG: Adding required parameter without default +class MyModel(Module): + __model_checkpoint_version__ = "2.0" + + def __init__(self, input_dim: int, output_dim: int, new_param: int): # No default! + super().__init__(meta=MyModelMetaData()) + # WRONG: Old checkpoints without 'new_param' will fail to load! + +# WRONG: Not incrementing version when making breaking changes +class MyModel(Module): + __model_checkpoint_version__ = "1.0" # Should be "2.0"! + + @classmethod + def _backward_compat_arg_mapper(cls, version: str, args: Dict[str, Any]) -> Dict[str, Any]: + # WRONG: Making breaking changes but not updating version number + if "hidden_dim" in args: + args["hidden_size"] = args.pop("hidden_dim") + return args + +# WRONG: Not calling super() in _backward_compat_arg_mapper +class MyModel(Module): + @classmethod + def _backward_compat_arg_mapper(cls, version: str, args: Dict[str, Any]) -> Dict[str, Any]: + # WRONG: Missing super()._backward_compat_arg_mapper(version, args) + if version == "1.0": + if "hidden_dim" in args: + args["hidden_size"] = args.pop("hidden_dim") + return args + +# WRONG: Changing return type without compatibility +class MyModel(Module): + __model_checkpoint_version__ = "2.0" + + def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: + # WRONG: v1.0 returned single tensor, v2.0 returns tuple - breaks user code! + return output, loss +``` + +--- + +### MOD-008: Minimal CI testing requirements + +**Description:** + +Every model in a module file `my_model_name.py` in `physicsnemo/nn` or +`physicsnemo/models` must have corresponding tests in +`test/models/test_.py`. Tests should roughly follow a similar +template and, three types of tests are required: + +1. **Constructor and attribute tests**: Verify model instantiation and all + public attributes (excluding buffers and parameters). + +2. **Non-regression test with reference data**: Instantiate a model, run + forward pass, and compare outputs against reference data saved in a `.pth` + file. + +3. **Non-regression test from checkpoint**: Load a model from a checkpoint file + (`.mdlus`) and verify outputs match reference data. + +Additional requirements: +- All tests must use `pytest` parameterization syntax +- At least 2 configurations must be tested: one with all default arguments, one + with non-default arguments. More variations specific to some relevant + use-cases are also encouraged. +- Test tensors must have realistic shapes (e.g. no singleton dimensions) and + should be as meaningful and representative of actual use cases as possible. +- All public methods must have the same non-regression tests as the forward + method +- Simply checking output shapes is NOT sufficient - actual values must be + compared +- **Critical:** Per MOD-002, it is forbidden to move a model out of the + experimental stage/directory without these tests + +**Rationale:** + +Comprehensive tests ensure model correctness and prevent regressions as code +evolves. Non-regression tests with reference data catch subtle numerical changes +that could break reproducibility. Checkpoint tests verify serialization and +deserialization work correctly. Parameterized tests ensure models work across +different configurations. These tests are required before models can graduate +from experimental to production status. + +**Example:** + +```python +# Good: Following the test_layers_unet_block.py template +import pytest +import torch +from physicsnemo.models import MyModel + +def _instantiate_model(cls, seed: int = 0, **kwargs): + """Helper to create model with reproducible parameters.""" + model = cls(**kwargs) + gen = torch.Generator(device="cpu") + gen.manual_seed(seed) + with torch.no_grad(): + for param in model.parameters(): + param.copy_(torch.randn(param.shape, generator=gen, dtype=param.dtype)) + return model + +@pytest.mark.parametrize("device", ["cuda:0", "cpu"]) +@pytest.mark.parametrize( + "config", + ["default", "custom"], + ids=["with_defaults", "with_custom_args"] +) +def test_my_model_non_regression(device, config): + """Test model forward pass against reference output.""" + # Setup model configuration + if config == "default": + model = _instantiate_model(MyModel, input_dim=64, output_dim=32) + else: + model = _instantiate_model( + MyModel, + input_dim=64, + output_dim=32, + hidden_dim=256, + dropout=0.1 + ) + + model = model.to(device) + + # Test constructor and attributes + assert model.input_dim == 64 + assert model.output_dim == 32 + if config == "custom": + assert model.hidden_dim == 256 + assert model.dropout == 0.1 + + # Load reference data (meaningful shapes, no singleton dimensions) + data = torch.load(f"test/models/data/my_model_{config}_v1.0.pth") + x = data["x"].to(device) # Shape: (4, 64), not (1, 64) + out_ref = data["out"].to(device) + + # Run forward and compare + out = model(x) + assert torch.allclose(out, out_ref, atol=1e-5, rtol=1e-5) + +@pytest.mark.parametrize("device", ["cuda:0", "cpu"]) +def test_my_model_from_checkpoint(device): + """Test loading model from checkpoint and verify outputs.""" + model = physicsnemo.Module.from_checkpoint( + "test/models/data/my_model_default_v1.0.mdlus" + ).to(device) + + # Test attributes + assert model.input_dim == 64 + assert model.output_dim == 32 + + # Load reference and verify + data = torch.load("test/models/data/my_model_default_v1.0.pth") + x = data["x"].to(device) + out_ref = data["out"].to(device) + out = model(x) + assert torch.allclose(out, out_ref, atol=1e-5, rtol=1e-5) + +@pytest.mark.parametrize("device", ["cuda:0", "cpu"]) +def test_my_model_public_method_non_regression(device): + """Test public method compute_loss against reference.""" + model = _instantiate_model(MyModel, input_dim=64, output_dim=32).to(device) + + data = torch.load("test/models/data/my_model_loss_v1.0.pth") + pred = data["pred"].to(device) + target = data["target"].to(device) + loss_ref = data["loss"].to(device) + + loss = model.compute_loss(pred, target) + assert torch.allclose(loss, loss_ref, atol=1e-6, rtol=1e-6) +``` + +**Anti-pattern:** + +```python +# WRONG: Only testing output shapes +def test_my_model_bad(device): + model = MyModel(input_dim=64, output_dim=32).to(device) + x = torch.randn(4, 64).to(device) + out = model(x) + assert out.shape == (4, 32) # NOT SUFFICIENT! + +# WRONG: Using singleton dimensions in test data +def test_my_model_bad(device): + x = torch.randn(1, 1, 64) # WRONG: Trivial shapes hide bugs + +# WRONG: No parameterization +def test_my_model_bad(): + model = MyModel(input_dim=64, output_dim=32) # Only tests defaults + +# WRONG: No checkpoint loading test +# (Missing test_my_model_from_checkpoint entirely) + +# WRONG: Public methods not tested +class MyModel(Module): + def compute_loss(self, pred, target): # No test for this method + return F.mse_loss(pred, target) +``` + +--- + +## Compliance + +When implementing models, ensure all rules are followed. Code reviews should +verify each rule is followed and enforce the rules as strictly as possible. +For exceptions to these rules, document the reasoning in code comments and +obtain approval during code review. diff --git a/greptile.json b/greptile.json index 80e1bcc4ea..b69f22795c 100644 --- a/greptile.json +++ b/greptile.json @@ -51,8 +51,8 @@ "files": [ { "scope": [], - "path": "", - "description": "" + "path": "CODING_STANDARDS/MODELS_IMPLEMENTATION.md", + "description": "List of coding standards and rules when editing any model code (new or existing). Typically applies to any model classes definition in python module file." } ] }