Skip to content

Conversation

sridhs21
Copy link
Contributor

@sridhs21 sridhs21 commented Aug 29, 2025

This PR enhances the training pipeline to address vanishing gradient problems that were preventing successful mixed precision training implementation. The primary focus is on implementing automatic mixed precision (AMP) support while making necessary architectural and pipeline improvements to ensure gradient stability. The changes include:

  • Automatic Mixed Precision (AMP) training support:

    • Adds --use-amp and --amp-dtype flags to enable mixed precision training with both float16 and bfloat16 support. This can significantly reduce memory usage and training time on compatible GPUs while maintaining numerical stability through proper gradient scaling.
  • Enhanced U-Net architecture with residual connections:

    • Replaces the basic U-Net with a residual block-based design that includes batch normalization and skip connections. This improves gradient flow and training stability while adding dropout layers to prevent overfitting, which was a common problem when implementing AMP.
  • Improved training pipeline:

    • Implements patch-based training through XPointPatchDataset for better data augmentation, adds feature normalization for training stability, includes gradient clipping to prevent exploding gradients, and adds early stopping with patience to prevent overfitting. The data is also resampled when training, in that it is undersampled. Patches with no X-points are removed to match the number of patches with X-points present to create a balanced dataset.
  • Enhanced optimization:

    • Switches from Adam to AdamW optimizer with weight decay for better generalization, adds cosine annealing learning rate scheduling, and improves checkpoint functionality to save/load all training state, including AMP scalers.

I have also updated the README file to include a description of what flags can be used.

@sridhs21 sridhs21 closed this Aug 29, 2025
@sridhs21 sridhs21 reopened this Aug 29, 2025
Copy link
Contributor

@cwsmith cwsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you. I appreciate the update to the README and the verbose PR description.

Would you please remove the whitespace only changes?

Update: as discussed in the meeting, if changing the whitespace breaks python then please ignore the whitespace request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants