-
Notifications
You must be signed in to change notification settings - Fork 9
Description
As noted #43, @lee218llnl: reported.
Occasionally I get hangs with STAT, particularly after running it multiple times. It appears to be in lmon__fe.cxx on line 4601 in a pthread_cond_timedwait. I don’t know if this is an actual affect or just a correlation, but it seems like if I subsequently attach TV to the job and detach TV, then I am able to attach again with STAT.
I have also seen hang-like behavior (looping) in cobo on cobo_connect_hostname. This also appears to happen if I aggressively attach/detach/attach STAT multiple times.
I suspect 1 is due to FIFO handling within jsrun but I need a simple reproducer to prove or disprove myself.
I suspect 2 is due to a problem with colocation service within jsrun, but I need a simple reproducer to prove or disprove myself.