We have a hidden issue in handling the fatal connection errors with epoll.
If EReadStatus CChannel::recvfrom(sockaddr* addr, CPacket& packet) const returns RST_ERROR, the epoll never gets notified about the error on that socket.
The test condition can be easily reproduced by returning RST_ERROR from the recvfrom function.
Also reverting changes of #468 (i.e. treating WSAECONNRESET as a fatal error on Windows) make the issue reproduceable on Windows (see the ConnectionTimeoutTest in the test folder).
What is happening in case recvfrom DOES NOT return an error?
The worker thread CRcvQueue::worker is in the while() loop. The socket, that attempts to connect, is in the list m_lRendezvousID. The CRcvQueue::worker thread calls void CRendezvousQueue::updateConnStatus() on every loop iteration. The function updates the socket status, in particular, when TTL is expired and the socket has to be closed.
// partial pseudocode
void* CRcvQueue::worker(void* param)
{
// (...)
while (!self->m_bClosing)
{
EReadStatus rst = self->worker_RetrieveUnit(Ref(id), Ref(unit), &sa);
if (rst == RST_ERROR)
{
// !!! This break will skip updateConnStatus on the socket
// and epoll is never notified
break;
}
self->m_pRendezvousQueue->updateConnStatus(rst, cst, unit->m_Packet);
}
return;
}
One of the following calls to void CRendezvousQueue::updateConnStatus() triggers an opened socket to be closed by timeout.
After this the epoll gets notified about the error on that socket. And then all the threads are closed as expected.
What is happening in case recvfrom DOES return an error?
If the recvfrom return an error, the worker thread CRcvQueue::worker loop breaks and thread function returns. This way the next call to the updateConnStatus() is skipped and never happens. Therefore the epoll is never notified about the error on this socket. srt_epoll_wait can only return on a wait timeout, instead of handling the error event on the socket.
This is probably a minor issue, because in general we still have the epoll wait timeout, that handles these conditions.
We have a hidden issue in handling the fatal connection errors with epoll.
If
EReadStatus CChannel::recvfrom(sockaddr* addr, CPacket& packet) constreturnsRST_ERROR, the epoll never gets notified about the error on that socket.The test condition can be easily reproduced by returning
RST_ERRORfrom therecvfromfunction.Also reverting changes of #468 (i.e. treating
WSAECONNRESETas a fatal error on Windows) make the issue reproduceable on Windows (see theConnectionTimeoutTestin thetestfolder).What is happening in case
recvfromDOES NOT return an error?The worker thread
CRcvQueue::workeris in the while() loop. The socket, that attempts to connect, is in the listm_lRendezvousID. TheCRcvQueue::workerthread callsvoid CRendezvousQueue::updateConnStatus()on every loop iteration. The function updates the socket status, in particular, when TTL is expired and the socket has to be closed.One of the following calls to
void CRendezvousQueue::updateConnStatus()triggers an opened socket to be closed by timeout.After this the
epollgets notified about the error on that socket. And then all the threads are closed as expected.What is happening in case
recvfromDOES return an error?If the
recvfromreturn an error, the worker threadCRcvQueue::workerloop breaks and thread function returns. This way the next call to theupdateConnStatus()is skipped and never happens. Therefore the epoll is never notified about the error on this socket.srt_epoll_waitcan only return on a wait timeout, instead of handling the error event on the socket.This is probably a minor issue, because in general we still have the
epollwait timeout, that handles these conditions.