Skip to content

Fix FES becoming unreachable after months of runtime#790

Draft
Copilot wants to merge 13 commits intocore4from
copilot/fix-fes-unreachable-issue
Draft

Fix FES becoming unreachable after months of runtime#790
Copilot wants to merge 13 commits intocore4from
copilot/fix-fes-unreachable-issue

Conversation

Copy link
Contributor

Copilot AI commented Feb 17, 2026

FES UDP/QUIC endpoints become permanently unreachable after months of runtime on Linux. The only recovery was a full process restart.

Root cause

The UDP recvfrom() blocked indefinitely with no timeout. If the kernel socket state degraded over months, the receive thread got stuck. Meanwhile, LastUDPPacketReceived was updated even on socket errors, masking the failure from the CheckUDPComm watchdog. The watchdog itself was cosmetic — it never triggered recovery or affected client routing.

Comms failure detection redesign

The old detection reset LastUDPPacketReceived on every login cookie arrival, so repeated login attempts masked real failures indefinitely. Replaced with a two-variable approach:

// Cookie arrival — starts a timer
if (PendingCookieReceived == 0)
    PendingCookieReceived = now;

// Successful UDP/QUIC receive — proves comms work, clears pending state
LastUDPPacketReceived = now;
PendingCookieReceived = 0;

// Watchdog — triggers only when a cookie has been waiting without any packet arriving
commsFailure = PendingCookieReceived != 0 && (now - PendingCookieReceived > 600s);

UDP socket recovery

  • Set SO_RCVTIMEO (1s) so the receive thread can service rebind/exit requests instead of blocking in recvfrom() forever
  • rebindSocket() closes and recreates the socket on the same port, resetting kernel state
  • Triggered automatically when the watchdog fires

QUIC listener recovery

  • CQuicTransceiver::restart() stops and re-starts the listener on the same port
  • Independent health check in the update loop, throttled to the watchdog interval

Removed stale RemoveClient on socket error

The original catch block queued a RemoveClient event on recvfrom() failure. On an unconnected UDP socket this is wrong — AddrFrom retains the stale value from the previous successful receive, disconnecting the wrong client. With SO_RCVTIMEO it would also fire every second.

Use-after-free fix

CFeSendSub::_DataSock stored a raw CUdpSock* copy. After rebindSocket() deletes and recreates the socket, it dangled. Changed to CUdpSock** pointing through CFEReceiveTask::DataSock.

Testing

Added simulateCommsFailure console command to trigger the full recovery chain on demand.

Original prompt

This section details on the original issue you should resolve

<issue_title>FES tends to become unreachable and show CheckUdpComms after running for months with few connections</issue_title>
<issue_description>FES tends to become unreachable and show CheckUdpComms after running for months with few connections. Needs to reboot to allow shard connections again.</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

When FES runs for months with few connections, LastUDPPacketReceived
becomes stale after clients disconnect. Previously, newCookieCallback
only reset it when it was 0 (first connection ever), so subsequent
connection attempts after long idle periods would immediately trigger
the CheckUDPComm alert because (now - staleTimestamp) > DelayBeforeUPDAlert.

Fix by:
1. Always resetting LastUDPPacketReceived in newCookieCallback when a
   new connection attempt arrives, not just on the first-ever connection.
2. Updating LastUDPPacketReceived when QUIC datagrams are received, so
   QUIC-only traffic also prevents the false alert.

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix FES unreliability after prolonged uptime Fix FES CheckUDPComm false alert after long idle periods Feb 17, 2026
Copilot AI requested a review from kaetemi February 17, 2026 13:03
…clients

When CheckUDPComm triggers, FES sends FS_UDP_ALIVE(false) to WS so it
stops routing new clients to the broken FES. When UDP communication
resumes, FES sends FS_UDP_ALIVE(true) to allow routing again.

Previously, CheckUDPComm was purely informational and did not affect
client routing. The WS would keep sending clients to a FES with broken
UDP, causing the shard to be effectively unreachable.

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot AI changed the title Fix FES CheckUDPComm false alert after long idle periods Fix FES becoming unreachable due to CheckUDPComm not affecting client routing Feb 17, 2026
Previously, LastUDPPacketReceived was updated on every iteration of
the receive loop, including socket errors. This masked real socket
failures because the timestamp appeared fresh, preventing CheckUDPComm
from triggering and the FS_UDP_ALIVE(false) notification from being
sent to the WS.

If recvfrom() enters a persistent error state (e.g., due to a Linux
kernel socket issue after months of runtime), the receive thread spins
on errors but the timestamp keeps getting updated, so the WS continues
routing clients to the broken FES, making the shard unreachable.

Moving the timestamp update inside the try block ensures it is only
set on successful packet receipt. Socket errors will cause the
timestamp to go stale, allowing CheckUDPComm to detect the failure
and trigger the FS_UDP_ALIVE notification to stop client routing.

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot AI changed the title Fix FES becoming unreachable due to CheckUDPComm not affecting client routing Fix FES becoming unreachable after months by properly detecting UDP socket failures Feb 17, 2026
Copilot AI and others added 2 commits February 17, 2026 15:18
… failure

The root problem is that the FES UDP socket can enter a bad kernel
state after months of runtime on Linux, and there's no recovery
mechanism. The recvfrom() call blocks forever with no timeout, and
once the socket fails, only a full FES restart can fix it.

Fix by:
1. Setting SO_RCVTIMEO (1 second) on the UDP socket so recvfrom()
   periodically returns even with no traffic. This prevents the
   receive thread from getting permanently stuck and allows it to
   check for exit/rebind requests.

2. Adding a rebindSocket() mechanism that closes and recreates the
   UDP socket on the same port, resetting any corrupted kernel state.
   This is triggered when CheckUDPComm detects a persistent failure
   (no UDP packets received within DelayBeforeUPDAlert).

3. Changing the socket error catch in run() to continue instead of
   queuing a RemoveClient event. Timeout errors (EAGAIN) from
   SO_RCVTIMEO should not generate events, and real errors on an
   unconnected UDP socket have no valid address to use anyway.

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Address code review feedback:
- Extract SO_RCVTIMEO configuration into setRecvTimeout() to avoid
  duplicating the platform-specific code in constructor and rebindSocket()
- Check setsockopt return value and log a warning if it fails

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot AI changed the title Fix FES becoming unreachable after months by properly detecting UDP socket failures Fix FES becoming unreachable after months: socket recovery and WS routing Feb 17, 2026
QUIC uses msquic which manages its own sockets internally, so the
UDP socket rebind doesn't help QUIC. The QUIC listener can stop
unexpectedly (e.g., due to internal msquic error, TLS certificate
issues) with no recovery mechanism.

Fix by:
1. Adding restart() method to CQuicTransceiver that calls stop()
   then start() on the same port, resetting the listener state.
2. Storing m_ListenPort in CQuicTransceiver for restart.
3. When CheckUDPComm triggers, also restart QUIC if it stopped.
4. Independently monitoring QUIC listener health in the update
   loop, throttled to check once per DelayBeforeUPDAlert period.

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot AI changed the title Fix FES becoming unreachable after months: socket recovery and WS routing Fix FES becoming unreachable after months: socket recovery, QUIC restart, and WS routing Feb 17, 2026
Copilot AI and others added 2 commits February 18, 2026 13:01
Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
…g after rebind

The send subsystem stored a raw pointer copy of DataSock at init time.
When rebindSocket() deletes the old socket and creates a new one,
this pointer becomes dangling, causing use-after-free when sending.

Fix by changing _DataSock from CUdpSock* to CUdpSock** (pointer to
the CFEReceiveTask::DataSock member), so it always dereferences to
the current socket after a rebind.

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot AI changed the title Fix FES becoming unreachable after months: socket recovery, QUIC restart, and WS routing Fix FES becoming unreachable after months: socket recovery, QUIC restart, WS routing Feb 18, 2026
… at startup

The detection logic uses uint32 seconds since Unix epoch (1970):
- CTime::getSecondsSince1970() returns uint32(time(NULL))
- Overflows on February 7, 2106 (~80 years from now)
- Unsigned subtraction (now - LastUDPPacketReceived) wraps correctly
  for elapsed time measurement across the overflow boundary

Previously LastUDPPacketReceived was initialized to 0, making
(now - 0) equal the full epoch time (~1.77 billion). This only
didn't false-trigger because of the outer guard conditions. Now
initialized to current time in the constructor for robustness.

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Replace the flawed approach of resetting LastUDPPacketReceived in
newCookieCallback (which masked real failures when users kept trying
to log in) with a proper two-variable detection:

- PendingCookieReceived: set to 'now' on first cookie arrival (only
  if currently 0). Cleared to 0 when a UDP/QUIC packet is received.
- Watchdog triggers when PendingCookieReceived != 0 and stale.
- LastUDPPacketReceived is now only updated by actual packet receive.

This means the alert only fires when someone is actually trying to
connect AND no packets have been received for the full delay period.
Repeated login attempts don't mask the failure by resetting the timer.

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
The FS_UDP_ALIVE mechanism creates a deadlock with self-recovery:
after socket rebind/QUIC restart, PendingCookieReceived is still
stale so commsFailure stays true. The WS has UDPAlive=false and
won't route new clients. No clients means no packets, so
PendingCookieReceived never clears and FS_UDP_ALIVE(true) is
never sent. The FES is permanently blocked.

The self-recovery (rebind/restart) is the actual fix. Remove:
- FS_UDP_ALIVE message sending from FES
- _UDPAlive member from CFrontEndService
- cbFESUDPAlive callback and UDPAlive flag from both WS
- UDPAlive check in findBestFES()

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot AI changed the title Fix FES becoming unreachable after months: socket recovery, QUIC restart, WS routing Fix FES becoming unreachable after months: socket recovery and comms failure detection Feb 18, 2026
The original catch block queued a RemoveClient event on recvfrom()
failure, intending to disconnect the client that caused the error.
However, on an unconnected UDP socket (serving all clients),
recvfrom() errors don't carry valid source addresses - AddrFrom
retains the stale value from the previous successful receive, so
the wrong client would be disconnected. With SO_RCVTIMEO, the
timeout also triggers every second, flooding the queue with
spurious disconnect events.

Co-authored-by: kaetemi <1581053+kaetemi@users.noreply.github.com>
Copilot AI changed the title Fix FES becoming unreachable after months: socket recovery and comms failure detection Fix FES becoming unreachable after months of runtime Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

FES tends to become unreachable and show CheckUdpComms after running for months with few connections

2 participants