-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
We're observing numerical discrepancies when running PyFR with CUDA-aware MPI compared to standard MPI. These differences persist across different machines and configurations, raising concerns about the consistency of numerical results when using CUDA-aware MP
Expected Behavior:
Numerical results should be consistent between CUDA-aware MPI and standard MPI, with differences limited to floating-point round-off error (e.g., last few digits).
Actual Behavior:
- Differences are observed beyond typical floating-point round-off tolerance.
- Discrepancies are more pronounced in multi-GPU runs.
- The differences persist across multiple machines with different hardware configurations.
Suggested Next Steps:
- Investigate whether different code paths are taken in PyFR when using CUDA-aware MPI vs. standard MPI.
- Check if asynchronous communication or stream handling differs between the two MPI modes.
- Confirm if the issue is related to order of operations or numerical reductions performed across GPUs.
- Provide guidance on recommended MPI configurations for reproducible numerical results with CUDA.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels