Cray MPICH Issue - OFI poll failed #7061
saisandeepdammati
started this conversation in
General
Replies: 2 comments 1 reply
-
|
Is this with HPE slingshot network? |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
Hi, Yes the machine uses HPE Slingshot network. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Howdy!
I am running a job using a parallel adaptive mesh refinement based CFD code on a Cray machine, Carpenter at ERDC. The code is written in C++ and is compiled using cray-mpich (version 8.1.26).
The job is performed with 4224 mpi processes on 22 nodes with 192 cores per node. The job runs for sometime (15-20 minutes) and then crashes with signal 9 error with the following MPICH error (complete error file is attached as text file):
I have looked at all the MPI_Send commands in the code and they look sensible, however, the runs crash with this error. Is this a familiar issue? Can you please provide me with a workaround or a fix for it?
Thanks in advance.
cray_mpich_error_carpenter.txt
Beta Was this translation helpful? Give feedback.
All reactions