-
Notifications
You must be signed in to change notification settings - Fork 456
Description
If using the uffd memory cache monitor, it appears as though sending a returning signal, such as SIGPROF, to the thread that is running ofi_uffd_handler will cause it to exit early.
In practice, this issue has been seen to cause NCCL programs run with profilers which rely on signals to deadlock/hang/deadlock when running with FI_MR_CACHE_MONITOR=userfaultfd across multiple nodes.
This is likely due to the handling of system calls such as poll and read, which have a chance to break early:
ret = poll(fds, 2, -1);
if (ret < 0 || fds[1].revents)
break;
...
ret = read(uffd.fd, &msg, sizeof(msg));
if (ret != sizeof(msg)) {
pthread_mutex_unlock(&mm_lock);
pthread_rwlock_unlock(&mm_list_rwlock);
if (errno != EAGAIN)
break;
These calls might need expanding to handle the EINTR case.
To Reproduce
thread_profiler.tar.gz
Compile the provided thread_profiler library by running make -f thread-profiler.makefile, this will create a library which intercepts pthread_create calls to setup signal timers which send SIGPROF for those spawned threads.
If you run a program with libfabric and preloading this library via LD_PRELOAD=libthreadprofiler.so, you will be able to see that the thread which spawns ofi_uffd_handler will exit early. This can be observed by observing the threads in GDB or running an strace on the application.
Expected behavior
The thread is able to gracefully handle signal interrupts.
Output
Here is the strace output after a SIGPROF signal is sent:
86483 14:52:11.041305 <... ppoll resumed>) = ? ERESTARTNOHAND (To be restarted if no handler) <0.205866>
86483 14:52:11.041328 --- SIGPROF {si_signo=SIGPROF, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_int=9046592, si_ptr=0x8a0a40} ---
86483 14:52:11.041363 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) <0.000008>
86483 14:52:11.041413 madvise(0x4000412f0000, 1966080, MADV_DONTNEED) = 0 <0.000012>
86483 14:52:11.041457 exit(0) = ?
Environment:
Seen occurring on a Linux HPE Slingshot system with both MPICH and CrayMPICH. Confirmed to occur with Libfabric version 2.3.1 and 2.2.0.