-
Notifications
You must be signed in to change notification settings - Fork 87
Description
We have a tape-backed filesystem with retrieve-on-read, meaning that processes can go into uninterruptible sleep (D) when trying to read a file, as they wait for it to be retrieved from tape. When this happens, reading /proc/$PID/cmdline on the sleeping process hangs forever, which I believe is explained here. Consequently, the ps calls done by NHC fail, and this triggers a node health check alert. However, this is not desirable: the node is actually running fine, it's just that one or more processes are sleeping and can't report their cmdline.
I'm not entirely sure which ps invocation in NHC is triggering this, as there are several. I wonder if we need to request the cmdline since it has this undesirable property of potentially hanging? Alternatively, if it's helpful to most users, can there be a configuration option to turn this off?