Skip to content

Conversation

@speriaswamy-amd
Copy link
Contributor

Summary:
Backports RCCL heatmap and result validation features from main to release/cvs-0.1.0.

Changes:

  • Add RCCL test result validation and aggregation with metadata collection
  • Add heatmap comparison against reference/golden data with percentage-based visualization
  • Handle frontend & backend NICs configuration
  • Handle missing NaN values and capitalization edge cases
  • Copy results to management node before losing slurm allocation
  • Add bandwidth dip detection with configurable thresholds for large message sizes
  • Add build_rccl_heatmap_table() and build_rccl_heatmap_metadata_table() functions
  • Minor f-string quote fixes

New files:

  • models/rccl.py
  • models/__init__.py
  • tests/rccl/rccl_heatmap_cvs.py
  • .gitignore

@speriaswamy-amd
Copy link
Contributor Author

Closing this PR, as it was rejected to be a candidate for the current release, the changes are planned to part of the next release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants