Fix FES becoming unreachable after months of runtime#790
Draft
Fix FES becoming unreachable after months of runtime#790
Conversation
When FES runs for months with few connections, LastUDPPacketReceived becomes stale after clients disconnect. Previously, newCookieCallback only reset it when it was 0 (first connection ever), so subsequent connection attempts after long idle periods would immediately trigger the CheckUDPComm alert because (now - staleTimestamp) > DelayBeforeUPDAlert. Fix by: 1. Always resetting LastUDPPacketReceived in newCookieCallback when a new connection attempt arrives, not just on the first-ever connection. 2. Updating LastUDPPacketReceived when QUIC datagrams are received, so QUIC-only traffic also prevents the false alert. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix FES unreliability after prolonged uptime
Fix FES CheckUDPComm false alert after long idle periods
Feb 17, 2026
…clients When CheckUDPComm triggers, FES sends FS_UDP_ALIVE(false) to WS so it stops routing new clients to the broken FES. When UDP communication resumes, FES sends FS_UDP_ALIVE(true) to allow routing again. Previously, CheckUDPComm was purely informational and did not affect client routing. The WS would keep sending clients to a FES with broken UDP, causing the shard to be effectively unreachable. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot
AI
changed the title
Fix FES CheckUDPComm false alert after long idle periods
Fix FES becoming unreachable due to CheckUDPComm not affecting client routing
Feb 17, 2026
Previously, LastUDPPacketReceived was updated on every iteration of the receive loop, including socket errors. This masked real socket failures because the timestamp appeared fresh, preventing CheckUDPComm from triggering and the FS_UDP_ALIVE(false) notification from being sent to the WS. If recvfrom() enters a persistent error state (e.g., due to a Linux kernel socket issue after months of runtime), the receive thread spins on errors but the timestamp keeps getting updated, so the WS continues routing clients to the broken FES, making the shard unreachable. Moving the timestamp update inside the try block ensures it is only set on successful packet receipt. Socket errors will cause the timestamp to go stale, allowing CheckUDPComm to detect the failure and trigger the FS_UDP_ALIVE notification to stop client routing. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot
AI
changed the title
Fix FES becoming unreachable due to CheckUDPComm not affecting client routing
Fix FES becoming unreachable after months by properly detecting UDP socket failures
Feb 17, 2026
… failure The root problem is that the FES UDP socket can enter a bad kernel state after months of runtime on Linux, and there's no recovery mechanism. The recvfrom() call blocks forever with no timeout, and once the socket fails, only a full FES restart can fix it. Fix by: 1. Setting SO_RCVTIMEO (1 second) on the UDP socket so recvfrom() periodically returns even with no traffic. This prevents the receive thread from getting permanently stuck and allows it to check for exit/rebind requests. 2. Adding a rebindSocket() mechanism that closes and recreates the UDP socket on the same port, resetting any corrupted kernel state. This is triggered when CheckUDPComm detects a persistent failure (no UDP packets received within DelayBeforeUPDAlert). 3. Changing the socket error catch in run() to continue instead of queuing a RemoveClient event. Timeout errors (EAGAIN) from SO_RCVTIMEO should not generate events, and real errors on an unconnected UDP socket have no valid address to use anyway. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Address code review feedback: - Extract SO_RCVTIMEO configuration into setRecvTimeout() to avoid duplicating the platform-specific code in constructor and rebindSocket() - Check setsockopt return value and log a warning if it fails Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot
AI
changed the title
Fix FES becoming unreachable after months by properly detecting UDP socket failures
Fix FES becoming unreachable after months: socket recovery and WS routing
Feb 17, 2026
QUIC uses msquic which manages its own sockets internally, so the UDP socket rebind doesn't help QUIC. The QUIC listener can stop unexpectedly (e.g., due to internal msquic error, TLS certificate issues) with no recovery mechanism. Fix by: 1. Adding restart() method to CQuicTransceiver that calls stop() then start() on the same port, resetting the listener state. 2. Storing m_ListenPort in CQuicTransceiver for restart. 3. When CheckUDPComm triggers, also restart QUIC if it stopped. 4. Independently monitoring QUIC listener health in the update loop, throttled to check once per DelayBeforeUPDAlert period. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot
AI
changed the title
Fix FES becoming unreachable after months: socket recovery and WS routing
Fix FES becoming unreachable after months: socket recovery, QUIC restart, and WS routing
Feb 17, 2026
Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
…g after rebind The send subsystem stored a raw pointer copy of DataSock at init time. When rebindSocket() deletes the old socket and creates a new one, this pointer becomes dangling, causing use-after-free when sending. Fix by changing _DataSock from CUdpSock* to CUdpSock** (pointer to the CFEReceiveTask::DataSock member), so it always dereferences to the current socket after a rebind. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot
AI
changed the title
Fix FES becoming unreachable after months: socket recovery, QUIC restart, and WS routing
Fix FES becoming unreachable after months: socket recovery, QUIC restart, WS routing
Feb 18, 2026
… at startup The detection logic uses uint32 seconds since Unix epoch (1970): - CTime::getSecondsSince1970() returns uint32(time(NULL)) - Overflows on February 7, 2106 (~80 years from now) - Unsigned subtraction (now - LastUDPPacketReceived) wraps correctly for elapsed time measurement across the overflow boundary Previously LastUDPPacketReceived was initialized to 0, making (now - 0) equal the full epoch time (~1.77 billion). This only didn't false-trigger because of the outer guard conditions. Now initialized to current time in the constructor for robustness. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Replace the flawed approach of resetting LastUDPPacketReceived in newCookieCallback (which masked real failures when users kept trying to log in) with a proper two-variable detection: - PendingCookieReceived: set to 'now' on first cookie arrival (only if currently 0). Cleared to 0 when a UDP/QUIC packet is received. - Watchdog triggers when PendingCookieReceived != 0 and stale. - LastUDPPacketReceived is now only updated by actual packet receive. This means the alert only fires when someone is actually trying to connect AND no packets have been received for the full delay period. Repeated login attempts don't mask the failure by resetting the timer. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
The FS_UDP_ALIVE mechanism creates a deadlock with self-recovery: after socket rebind/QUIC restart, PendingCookieReceived is still stale so commsFailure stays true. The WS has UDPAlive=false and won't route new clients. No clients means no packets, so PendingCookieReceived never clears and FS_UDP_ALIVE(true) is never sent. The FES is permanently blocked. The self-recovery (rebind/restart) is the actual fix. Remove: - FS_UDP_ALIVE message sending from FES - _UDPAlive member from CFrontEndService - cbFESUDPAlive callback and UDPAlive flag from both WS - UDPAlive check in findBestFES() Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot
AI
changed the title
Fix FES becoming unreachable after months: socket recovery, QUIC restart, WS routing
Fix FES becoming unreachable after months: socket recovery and comms failure detection
Feb 18, 2026
The original catch block queued a RemoveClient event on recvfrom() failure, intending to disconnect the client that caused the error. However, on an unconnected UDP socket (serving all clients), recvfrom() errors don't carry valid source addresses - AddrFrom retains the stale value from the previous successful receive, so the wrong client would be disconnected. With SO_RCVTIMEO, the timeout also triggers every second, flooding the queue with spurious disconnect events. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot
AI
changed the title
Fix FES becoming unreachable after months: socket recovery and comms failure detection
Fix FES becoming unreachable after months of runtime
Feb 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FES UDP/QUIC endpoints become permanently unreachable after months of runtime on Linux. The only recovery was a full process restart.
Root cause
The UDP
recvfrom()blocked indefinitely with no timeout. If the kernel socket state degraded over months, the receive thread got stuck. Meanwhile,LastUDPPacketReceivedwas updated even on socket errors, masking the failure from theCheckUDPCommwatchdog. The watchdog itself was cosmetic — it never triggered recovery or affected client routing.Comms failure detection redesign
The old detection reset
LastUDPPacketReceivedon every login cookie arrival, so repeated login attempts masked real failures indefinitely. Replaced with a two-variable approach:UDP socket recovery
SO_RCVTIMEO(1s) so the receive thread can service rebind/exit requests instead of blocking inrecvfrom()foreverrebindSocket()closes and recreates the socket on the same port, resetting kernel stateQUIC listener recovery
CQuicTransceiver::restart()stops and re-starts the listener on the same portRemoved stale
RemoveClienton socket errorThe original catch block queued a
RemoveClientevent onrecvfrom()failure. On an unconnected UDP socket this is wrong —AddrFromretains the stale value from the previous successful receive, disconnecting the wrong client. WithSO_RCVTIMEOit would also fire every second.Use-after-free fix
CFeSendSub::_DataSockstored a rawCUdpSock*copy. AfterrebindSocket()deletes and recreates the socket, it dangled. Changed toCUdpSock**pointing throughCFEReceiveTask::DataSock.Testing
Added
simulateCommsFailureconsole command to trigger the full recovery chain on demand.Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.