You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Resolve 30-second throughput ramp-up issue for TSO-enabled
TCP connections by fixing multiple congestion control bugs
and implementing TSO-aware initial window and RTO recovery
parameters.
Problem:
Applications using TSO experienced ~30 seconds of near-zero
throughput before achieving line-rate. Investigation revealed
multiple interrelated bugs in XLIO's congestion control
implementation:
1. ssthresh Reset Bug: ssthresh was unconditionally reset to
10*MSS (14,600 bytes) during SYN-ACK processing in
tcp_in.c, forcing TCP into congestion avoidance mode
immediately instead of slow start.
2. Slow Start Algorithm Bug: Both LWIP and CUBIC modules
implemented incorrect linear cwnd growth (cwnd += mss)
instead of RFC 5681's exponential growth (cwnd += acked).
This is particularly devastating for TSO where one ACK can
acknowledge 64KB (44 segments), but cwnd only grew by
1 MSS (1460 bytes).
3. RTO Recovery Bug: On retransmission timeout, cwnd was
reset to 1 MSS per RFC 5681, which is too conservative for
modern TSO hardware that sends 256KB super-packets, causing
20+ second recovery times.
Root Cause Analysis:
XLIO's CUBIC implementation is based on FreeBSD CUBIC
(2007-2010), which contained the slow start bug. Modern CUBIC
(Linux kernel 2024, RFC 9438) uses standard TCP slow start
with exponential growth. Verification against Linux kernel
source (net/ipv4/tcp_cubic.c) confirmed that XLIO's behavior
was incorrect and outdated.
Solution:
1. Centralized Initial Window Management:
Created tcp_set_initial_cwnd_ssthresh() to consistently
set TSO-aware parameters across all connection
initialization paths:
For TSO-enabled connections:
- cwnd = TSO_max_payload / 4 (64KB with default 256KB)
- ssthresh = 0x7FFFFFFF (2GB - effectively unlimited)
For non-TSO connections:
- cwnd = RFC 3390: min(4*MSS, max(2*MSS, 4380 bytes))
- ssthresh = 10 * MSS
2. Fixed Slow Start Algorithm:
Changed cwnd increment from mss to acked in both CC
modules:
Before (WRONG):
pcb->cwnd += pcb->mss; // Linear growth
After (CORRECT):
pcb->cwnd += pcb->acked; // Exponential per RFC 5681
This matches modern implementations including Linux CUBIC
and is critical for TSO where one ACK can acknowledge many
segments.
3. TSO-Aware RTO Recovery:
Created tcp_reset_cwnd_on_congestion() for TSO-aware RTO
handling:
For TSO connections (deviates from RFC 5681):
- cwnd = 26KB (10% of TSO max) instead of 1 MSS
- ssthresh = 64KB (25% of TSO max) for slow start
For non-TSO connections (RFC 5681 compliant):
- cwnd = 1 MSS
- ssthresh = max(FlightSize/2, 2*MSS)
Rationale: RFC 5681 predates aggressive TSO. Even Linux
CUBIC (which follows RFC) suffers from slow RTO recovery
with large TSO. Our TSO-aware approach balances fast
recovery with congestion safety.
Implementation Details:
- Consolidated initialization logic in 6 locations:
* tcp_pcb_init() - initial PCB setup
* tcp_pcb_recycle() - PCB reuse after TIME_WAIT
* tcp_connect() - client-side connection initiation
* tcp_in.c SYN-ACK handler
* lwip_conn_init() - LWIP CC module initialization
* cubic_conn_init() - CUBIC CC module initialization
- Fixed slow start in 2 locations:
* cc_lwip.c:lwip_ack_received() - changed mss to acked
* cc_cubic.c:cubic_ack_received() - changed mss to acked
- Centralized RTO recovery in 2 locations:
* cc_lwip.c:lwip_cong_signal() - calls helper
* cc_cubic.c:cubic_cong_signal() - calls helper
Performance Impact:
Before: 20+ seconds to reach line-rate (200-300 Gbps)
After: Line-rate achieved in <1 second
Technical Rationale:
1. Very High ssthresh (2GB): Follows industry best practices
for high-BDP networks, allowing slow start to discover
optimal window rather than artificially limiting growth.
Standard TCP behavior for modern data center deployments.
2. TSO-Independence: TSO max payload is determined by
hardware capabilities, not negotiated MSS. Initial window
should similarly be independent of MSS for TSO
connections.
3. Initial cwnd (64KB): Balances aggressive throughput with
conservative buffer management. Exceeds RFC 6928's 10
segments (~15KB) but appropriate for XLIO's controlled
environment where TSO hardware handles segmentation and
applications target high-throughput scenarios.
4. RTO Recovery (26KB): More conservative than initial window
(25% vs 10% of TSO max) to balance fast recovery with
safety. While this deviates from RFC 5681 (1 MSS), it
recognizes the reality that 1460 bytes is artificially
small when hardware sends 256KB super-packets.
RFC Compliance Notes:
- Initial window and slow start: Compliant with RFC 5681/3390
spirit, optimized for TSO hardware.
- RTO recovery: Intentionally deviates from RFC 5681 for TSO
to address modern hardware reality. Non-TSO connections
remain RFC-compliant.
Comparison to Modern Implementations:
- Linux CUBIC (2024): Uses standard slow start
(cwnd += acked)
- Linux CUBIC RTO: Follows RFC 5681 (cwnd = 1 MSS), which
causes same slow recovery issue with aggressive TSO that we
fix here
- FreeBSD CUBIC (2007-2010): Had slow start bug
(cwnd += mss) that XLIO inherited and we now fix
References:
- RFC 3390: Increasing TCP's Initial Window
- RFC 5681: TCP Congestion Control (slow start, RTO behavior)
- RFC 6928: Increasing TCP's Initial Window (10 segments)
- RFC 9438: CUBIC for Fast and Long-Distance Networks (2023,
obsoletes RFC 8312)
- Linux kernel: net/ipv4/tcp_cubic.c (verified modern CUBIC
behavior)
https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_cubic.c
- Excentis: "Optimizing TCP Congestion Avoidance Parameters
for Gigabit Networks" - recommends very high ssthresh for
fast networks
Verification:
- GDB debugging confirmed ssthresh overwrite during SYN-ACK
processing
- Post-fix: cwnd=64KB and ssthresh=2GB maintained through
connection setup
- Enhanced debug logs confirmed exponential cwnd growth during
slow start
- Throughput testing: 200-300 Gbps achieved in <1s
(previously 20+s)
Signed-off-by: Tomer Cabouly <[email protected]>
0 commit comments