Training Pipeline Enhancements #16

sridhs21 · 2025-08-29T02:27:51Z

This PR enhances the training pipeline to address vanishing gradient problems that were preventing successful mixed precision training implementation. The primary focus is on implementing automatic mixed precision (AMP) support while making necessary architectural and pipeline improvements to ensure gradient stability. The changes include:

Automatic Mixed Precision (AMP) training support:
- Adds --use-amp and --amp-dtype flags to enable mixed precision training with both float16 and bfloat16 support. This can significantly reduce memory usage and training time on compatible GPUs while maintaining numerical stability through proper gradient scaling.
Enhanced U-Net architecture with residual connections:
- Replaces the basic U-Net with a residual block-based design that includes batch normalization and skip connections. This improves gradient flow and training stability while adding dropout layers to prevent overfitting, which was a common problem when implementing AMP.
Improved training pipeline:
- Implements patch-based training through XPointPatchDataset for better data augmentation, adds feature normalization for training stability, includes gradient clipping to prevent exploding gradients, and adds early stopping with patience to prevent overfitting. The data is also resampled when training, in that it is undersampled. Patches with no X-points are removed to match the number of patches with X-points present to create a balanced dataset.
Enhanced optimization:
- Switches from Adam to AdamW optimizer with weight decay for better generalization, adds cosine annealing learning rate scheduling, and improves checkpoint functionality to save/load all training state, including AMP scalers.

I have also updated the README file to include a description of what flags can be used.

…ecision

cwsmith

Looks good. Thank you. I appreciate the update to the README and the verbose PR description.

Would you please remove the whitespace only changes?

Update: as discussed in the meeting, if changing the whitespace breaks python then please ignore the whitespace request

XPointMLTest.py

sridhs21 added 7 commits July 30, 2025 17:06

Implemented Automatic Mixed Precision (AMP) for training

6147606

Major architecture upgrade: ResNet-style U-Net + modern training

527a874

Merge branch 'main' of github.com:scorec/reconClassifier into MixedPr…

024d9ef

…ecision

Made changes to make implementations merged from main work

633e111

made changes to make implementations from main work

0054ff2

Fixed bug and updated README for added flags.

e8767ee

fixed bug - psi return value

265780b

sridhs21 closed this Aug 29, 2025

revert to default params

f99434f

sridhs21 reopened this Aug 29, 2025

cwsmith requested changes Aug 29, 2025

View reviewed changes

cwsmith reviewed Aug 29, 2025

View reviewed changes

XPointMLTest.py Show resolved Hide resolved

XPointMLTest.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training Pipeline Enhancements #16

Training Pipeline Enhancements #16

Uh oh!

sridhs21 commented Aug 29, 2025 •

edited

Loading

Uh oh!

cwsmith left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Training Pipeline Enhancements #16

Are you sure you want to change the base?

Training Pipeline Enhancements #16

Uh oh!

Conversation

sridhs21 commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cwsmith left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sridhs21 commented Aug 29, 2025 •

edited

Loading

cwsmith left a comment •

edited

Loading