Fix FES becoming unreachable after months of runtime by Copilot · Pull Request #790 · ryzom/ryzomcore

Copilot · 2026-02-17T12:49:32Z

FES UDP/QUIC endpoints become permanently unreachable after months of runtime on Linux. The only recovery was a full process restart.

Root cause

The UDP recvfrom() blocked indefinitely with no timeout. If the kernel socket state degraded over months, the receive thread got stuck. Meanwhile, LastUDPPacketReceived was updated even on socket errors, masking the failure from the CheckUDPComm watchdog. The watchdog itself was cosmetic — it never triggered recovery or affected client routing.

Comms failure detection redesign

The old detection reset LastUDPPacketReceived on every login cookie arrival, so repeated login attempts masked real failures indefinitely. Replaced with a two-variable approach:

// Cookie arrival — starts a timer
if (PendingCookieReceived == 0)
    PendingCookieReceived = now;

// Successful UDP/QUIC receive — proves comms work, clears pending state
LastUDPPacketReceived = now;
PendingCookieReceived = 0;

// Watchdog — triggers only when a cookie has been waiting without any packet arriving
commsFailure = PendingCookieReceived != 0 && (now - PendingCookieReceived > 600s);

UDP socket recovery

Set SO_RCVTIMEO (1s) so the receive thread can service rebind/exit requests instead of blocking in recvfrom() forever
rebindSocket() closes and recreates the socket on the same port, resetting kernel state
Triggered automatically when the watchdog fires

QUIC listener recovery

CQuicTransceiver::restart() stops and re-starts the listener on the same port
Independent health check in the update loop, throttled to the watchdog interval

Removed stale `RemoveClient` on socket error

The original catch block queued a RemoveClient event on recvfrom() failure. On an unconnected UDP socket this is wrong — AddrFrom retains the stale value from the previous successful receive, disconnecting the wrong client. With SO_RCVTIMEO it would also fire every second.

Use-after-free fix

CFeSendSub::_DataSock stored a raw CUdpSock* copy. After rebindSocket() deletes and recreates the socket, it dangled. Changed to CUdpSock** pointing through CFEReceiveTask::DataSock.

Testing

Added simulateCommsFailure console command to trigger the full recovery chain on demand.

Original prompt

This section details on the original issue you should resolve

<issue_title>FES tends to become unreachable and show CheckUdpComms after running for months with few connections</issue_title>
<issue_description>FES tends to become unreachable and show CheckUdpComms after running for months with few connections. Needs to reboot to allow shard connections again.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes FES tends to become unreachable and show CheckUdpComms after running for months with few connections #789

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

When FES runs for months with few connections, LastUDPPacketReceived becomes stale after clients disconnect. Previously, newCookieCallback only reset it when it was 0 (first connection ever), so subsequent connection attempts after long idle periods would immediately trigger the CheckUDPComm alert because (now - staleTimestamp) > DelayBeforeUPDAlert. Fix by: 1. Always resetting LastUDPPacketReceived in newCookieCallback when a new connection attempt arrives, not just on the first-ever connection. 2. Updating LastUDPPacketReceived when QUIC datagrams are received, so QUIC-only traffic also prevents the false alert. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

…clients When CheckUDPComm triggers, FES sends FS_UDP_ALIVE(false) to WS so it stops routing new clients to the broken FES. When UDP communication resumes, FES sends FS_UDP_ALIVE(true) to allow routing again. Previously, CheckUDPComm was purely informational and did not affect client routing. The WS would keep sending clients to a FES with broken UDP, causing the shard to be effectively unreachable. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

Previously, LastUDPPacketReceived was updated on every iteration of the receive loop, including socket errors. This masked real socket failures because the timestamp appeared fresh, preventing CheckUDPComm from triggering and the FS_UDP_ALIVE(false) notification from being sent to the WS. If recvfrom() enters a persistent error state (e.g., due to a Linux kernel socket issue after months of runtime), the receive thread spins on errors but the timestamp keeps getting updated, so the WS continues routing clients to the broken FES, making the shard unreachable. Moving the timestamp update inside the try block ensures it is only set on successful packet receipt. Socket errors will cause the timestamp to go stale, allowing CheckUDPComm to detect the failure and trigger the FS_UDP_ALIVE notification to stop client routing. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

… failure The root problem is that the FES UDP socket can enter a bad kernel state after months of runtime on Linux, and there's no recovery mechanism. The recvfrom() call blocks forever with no timeout, and once the socket fails, only a full FES restart can fix it. Fix by: 1. Setting SO_RCVTIMEO (1 second) on the UDP socket so recvfrom() periodically returns even with no traffic. This prevents the receive thread from getting permanently stuck and allows it to check for exit/rebind requests. 2. Adding a rebindSocket() mechanism that closes and recreates the UDP socket on the same port, resetting any corrupted kernel state. This is triggered when CheckUDPComm detects a persistent failure (no UDP packets received within DelayBeforeUPDAlert). 3. Changing the socket error catch in run() to continue instead of queuing a RemoveClient event. Timeout errors (EAGAIN) from SO_RCVTIMEO should not generate events, and real errors on an unconnected UDP socket have no valid address to use anyway. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

Address code review feedback: - Extract SO_RCVTIMEO configuration into setRecvTimeout() to avoid duplicating the platform-specific code in constructor and rebindSocket() - Check setsockopt return value and log a warning if it fails Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

QUIC uses msquic which manages its own sockets internally, so the UDP socket rebind doesn't help QUIC. The QUIC listener can stop unexpectedly (e.g., due to internal msquic error, TLS certificate issues) with no recovery mechanism. Fix by: 1. Adding restart() method to CQuicTransceiver that calls stop() then start() on the same port, resetting the listener state. 2. Storing m_ListenPort in CQuicTransceiver for restart. 3. When CheckUDPComm triggers, also restart QUIC if it stopped. 4. Independently monitoring QUIC listener health in the update loop, throttled to check once per DelayBeforeUPDAlert period. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

…g after rebind The send subsystem stored a raw pointer copy of DataSock at init time. When rebindSocket() deletes the old socket and creates a new one, this pointer becomes dangling, causing use-after-free when sending. Fix by changing _DataSock from CUdpSock* to CUdpSock** (pointer to the CFEReceiveTask::DataSock member), so it always dereferences to the current socket after a rebind. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

… at startup The detection logic uses uint32 seconds since Unix epoch (1970): - CTime::getSecondsSince1970() returns uint32(time(NULL)) - Overflows on February 7, 2106 (~80 years from now) - Unsigned subtraction (now - LastUDPPacketReceived) wraps correctly for elapsed time measurement across the overflow boundary Previously LastUDPPacketReceived was initialized to 0, making (now - 0) equal the full epoch time (~1.77 billion). This only didn't false-trigger because of the outer guard conditions. Now initialized to current time in the constructor for robustness. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

Replace the flawed approach of resetting LastUDPPacketReceived in newCookieCallback (which masked real failures when users kept trying to log in) with a proper two-variable detection: - PendingCookieReceived: set to 'now' on first cookie arrival (only if currently 0). Cleared to 0 when a UDP/QUIC packet is received. - Watchdog triggers when PendingCookieReceived != 0 and stale. - LastUDPPacketReceived is now only updated by actual packet receive. This means the alert only fires when someone is actually trying to connect AND no packets have been received for the full delay period. Repeated login attempts don't mask the failure by resetting the timer. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

The FS_UDP_ALIVE mechanism creates a deadlock with self-recovery: after socket rebind/QUIC restart, PendingCookieReceived is still stale so commsFailure stays true. The WS has UDPAlive=false and won't route new clients. No clients means no packets, so PendingCookieReceived never clears and FS_UDP_ALIVE(true) is never sent. The FES is permanently blocked. The self-recovery (rebind/restart) is the actual fix. Remove: - FS_UDP_ALIVE message sending from FES - _UDPAlive member from CFrontEndService - cbFESUDPAlive callback and UDPAlive flag from both WS - UDPAlive check in findBestFES() Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

The original catch block queued a RemoveClient event on recvfrom() failure, intending to disconnect the client that caused the error. However, on an unconnected UDP socket (serving all clients), recvfrom() errors don't carry valid source addresses - AddrFrom retains the stale value from the previous successful receive, so the wrong client would be disconnected. With SO_RCVTIMEO, the timeout also triggers every second, flooding the queue with spurious disconnect events. Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

Initial plan

c3002ba

Copilot AI assigned Copilot and kaetemi Feb 17, 2026

Copilot started work on behalf of kaetemi February 17, 2026 12:49 View session

Copilot AI changed the title ~~[WIP] Fix FES unreliability after prolonged uptime~~ Fix FES CheckUDPComm false alert after long idle periods Feb 17, 2026

Copilot AI requested a review from kaetemi February 17, 2026 13:03

Copilot finished work on behalf of kaetemi February 17, 2026 13:03

Copilot started work on behalf of kaetemi February 17, 2026 13:24 View session

Copilot AI changed the title ~~Fix FES CheckUDPComm false alert after long idle periods~~ Fix FES becoming unreachable due to CheckUDPComm not affecting client routing Feb 17, 2026

Copilot finished work on behalf of kaetemi February 17, 2026 13:41

Copilot started work on behalf of kaetemi February 17, 2026 14:43 View session

Copilot AI changed the title ~~Fix FES becoming unreachable due to CheckUDPComm not affecting client routing~~ Fix FES becoming unreachable after months by properly detecting UDP socket failures Feb 17, 2026

Copilot finished work on behalf of kaetemi February 17, 2026 15:04

Copilot started work on behalf of kaetemi February 17, 2026 15:09 View session

Copilot AI and others added 2 commits February 17, 2026 15:18

Copilot AI changed the title ~~Fix FES becoming unreachable after months by properly detecting UDP socket failures~~ Fix FES becoming unreachable after months: socket recovery and WS routing Feb 17, 2026

Copilot finished work on behalf of kaetemi February 17, 2026 15:22

Copilot started work on behalf of kaetemi February 17, 2026 15:31 View session

Copilot AI changed the title ~~Fix FES becoming unreachable after months: socket recovery and WS routing~~ Fix FES becoming unreachable after months: socket recovery, QUIC restart, and WS routing Feb 17, 2026

Copilot finished work on behalf of kaetemi February 17, 2026 15:41

Copilot started work on behalf of kaetemi February 18, 2026 12:56 View session

Copilot AI and others added 2 commits February 18, 2026 13:01

Add simulateCommsFailure command for testing recovery mechanism

0cb6362

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>

Copilot AI changed the title ~~Fix FES becoming unreachable after months: socket recovery, QUIC restart, and WS routing~~ Fix FES becoming unreachable after months: socket recovery, QUIC restart, WS routing Feb 18, 2026

Copilot finished work on behalf of kaetemi February 18, 2026 13:07

Copilot started work on behalf of kaetemi February 18, 2026 13:37 View session

Copilot finished work on behalf of kaetemi February 18, 2026 13:46

Copilot started work on behalf of kaetemi February 18, 2026 14:03 View session

Copilot finished work on behalf of kaetemi February 18, 2026 14:10

Copilot started work on behalf of kaetemi February 18, 2026 14:26 View session

Copilot AI changed the title ~~Fix FES becoming unreachable after months: socket recovery, QUIC restart, WS routing~~ Fix FES becoming unreachable after months: socket recovery and comms failure detection Feb 18, 2026

Copilot finished work on behalf of kaetemi February 18, 2026 14:35

Copilot started work on behalf of kaetemi February 18, 2026 15:06 View session

Copilot AI changed the title ~~Fix FES becoming unreachable after months: socket recovery and comms failure detection~~ Fix FES becoming unreachable after months of runtime Feb 18, 2026

Copilot finished work on behalf of kaetemi February 18, 2026 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FES becoming unreachable after months of runtime#790

Fix FES becoming unreachable after months of runtime#790
Copilot wants to merge 13 commits intocore4from
copilot/fix-fes-unreachable-issue

Copilot AI commented Feb 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Comms failure detection redesign

UDP socket recovery

QUIC listener recovery

Removed stale RemoveClient on socket error

Use-after-free fix

Testing

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 17, 2026 •

edited

Loading

Removed stale `RemoveClient` on socket error