-
Notifications
You must be signed in to change notification settings - Fork 29
issue: 4724535 Fix TSO SS aggressive cwnd/ssthresh #511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: vNext
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical TSO throughput issue where connections took 30+ seconds to reach line-rate, now achieving it in <1 second. The root cause was aggressive ssthresh values being overwritten with 10*MSS during SYN-ACK processing, forcing TCP into congestion avoidance mode prematurely.
Key Changes:
- Centralized cwnd/ssthresh initialization in new
tcp_set_initial_cwnd_ssthresh()helper - TSO-enabled connections:
cwnd = max_payload_sz/4(64KB),ssthresh = 2GB(effectively unlimited) - Non-TSO connections: RFC 3390 compliant
cwnd,ssthresh = 10*MSS - Replaced 6 initialization sites:
tcp_pcb_init(),tcp_pcb_recycle(),tcp_connect(), SYN-ACK handler,lwip_conn_init(),cubic_conn_init()
Technical Approach:
The fix properly handles the timing where TSO configuration happens after initial PCB setup but before SYN-ACK. The helper function is TSO-aware and uses tso.max_payload_sz (independent of negotiated MSS) for TSO connections, following industry best practices for high-speed networks.
Minor Issue Found:
- Dead code at tcp_in.c:588 in
#elsebranch (unreachable sinceTCP_CC_ALGO_MOD=1)
Confidence Score: 4/5
- This PR is safe to merge with one minor style improvement recommended
- Score reflects well-tested performance fix with clear rationale, centralized implementation, and proper TSO/non-TSO handling. Minor deduction for dead code in tcp_in.c that could be cleaned up. The core logic is sound and addresses a real production issue with empirical validation (200 Gbps in <1s).
- src/core/lwip/tcp_in.c - contains unreachable dead code in #else branch at line 588 that should be removed for clarity
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/core/lwip/tcp.c | 4/5 | Adds centralized tcp_set_initial_cwnd_ssthresh() function and replaces hardcoded cwnd/ssthresh initialization in tcp_connect(), tcp_pcb_init(), and tcp_pcb_recycle(). TSO-aware logic properly handles both TSO-enabled and non-TSO cases. |
| src/core/lwip/tcp.h | 5/5 | Adds function declaration for tcp_set_initial_cwnd_ssthresh(). Clean header change with no issues. |
| src/core/lwip/tcp_in.c | 3/5 | Replaces hardcoded ssthresh = mss * 10 with tcp_set_initial_cwnd_ssthresh() in SYN-ACK handler. Critical fix for TSO throughput. Contains dead code in #else branch that could cause confusion. |
| src/core/lwip/cc_lwip.c | 5/5 | Updates lwip_conn_init() to call tcp_set_initial_cwnd_ssthresh() when cwnd is uninitialized. Preserves existing behavior with cleaner code. |
| src/core/lwip/cc_cubic.c | 5/5 | Updates cubic_conn_init() to call tcp_set_initial_cwnd_ssthresh() when cwnd is uninitialized. Removes old hardcoded logic, cleaner implementation. |
Sequence Diagram
sequenceDiagram
participant App as Application
participant Socket as sockinfo_tcp
participant PCB as tcp_pcb
participant TCPIn as tcp_in (SYN-ACK)
participant Helper as tcp_set_initial_cwnd_ssthresh()
participant CC as Congestion Control Module
Note over App,CC: Connection Establishment Flow
App->>Socket: socket() creation
Socket->>PCB: tcp_pcb_init()
PCB->>Helper: Set initial cwnd/ssthresh
Note over Helper: tso.max_payload_sz = 0<br/>Uses non-TSO values<br/>cwnd = RFC 3390<br/>ssthresh = 10*MSS
Helper-->>PCB: cwnd set
App->>Socket: connect()
Socket->>PCB: tcp_connect()
PCB->>Helper: Re-initialize cwnd/ssthresh
Note over Helper: TSO still not configured<br/>Uses non-TSO values
Helper-->>PCB: cwnd set
PCB->>TCPIn: Send SYN
Note over Socket,PCB: TSO Configuration
Socket->>PCB: prepare_to_send()
Note over PCB: tso.max_payload_sz = 256KB<br/>(from hardware caps)
TCPIn->>PCB: Receive SYN-ACK
Note over TCPIn: MSS negotiation complete
TCPIn->>Helper: CRITICAL: Re-initialize after MSS
Note over Helper: NOW TSO is configured!<br/>cwnd = max_payload_sz/4 = 64KB<br/>ssthresh = 0x7FFFFFFF (2GB)
Helper-->>TCPIn: Aggressive TSO values set
TCPIn->>CC: cc_conn_init()
Note over CC: Preserves cwnd if already set<br/>(checks cwnd != 1)
CC-->>TCPIn: Connection ready
Note over App,CC: Fast Slow Start with TSO
TCPIn->>PCB: Start data transmission
Note over PCB: Exponential growth with<br/>cwnd=64KB, ssthresh=2GB<br/>Reaches line-rate in <1s
Additional Comments (1)
-
src/core/lwip/tcp_in.c, line 588 (link)style: Dead code: this
#elsebranch overwrites the cwnd just set bytcp_set_initial_cwnd_ssthresh()on line 584. WhileTCP_CC_ALGO_MODis always defined to 1 (opt.h:114), making this unreachable, the dead code could confuse future maintainers.
5 files reviewed, 1 comment
6797c58 to
d24163b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Fixes critical 30-second throughput ramp-up issue for TSO-enabled TCP connections by implementing proper initial congestion window and slow start threshold values.
Key Changes:
- Introduced centralized
tcp_set_initial_cwnd_ssthresh()helper function (tcp.c:520-549) that sets TSO-aware parameters: cwnd=64KB and ssthresh=2GB for TSO, or RFC 3390-compliant values for non-TSO - Fixed critical bug at tcp_in.c:584 where
ssthreshwas hardcoded to 10*MSS during SYN-ACK processing, forcing immediate congestion avoidance mode instead of slow start - Replaced duplicate initialization logic across 6 locations:
tcp_pcb_init(),tcp_pcb_recycle(),tcp_connect(), SYN-ACK handler,lwip_conn_init(), andcubic_conn_init() - Added guard logic in CC modules to preserve cwnd/ssthresh values set during connection establishment (only re-initialize if cwnd==1)
Technical Approach:
The fix uses aggressive initial parameters for TSO (cwnd=TSO_max/4, ssthresh=2GB) that are independent of negotiated MSS, allowing exponential slow start growth to quickly discover optimal sending rate. For non-TSO, maintains RFC 3390 compliance with MSS-based initialization.
Performance Impact:
Reduces time-to-line-rate from 20+ seconds to <1 second for 200 Gbps TSO connections.
Confidence Score: 4/5
- Safe to merge with minor considerations for aggressive TSO parameters in diverse network conditions
- The implementation is sound with proper centralization and correct ordering (tcp_set_initial_cwnd_ssthresh before cc_conn_init). Score reduced from 5 to 4 due to: (1) Very aggressive TSO parameters (64KB initial cwnd, 2GB ssthresh) deviate significantly from RFC 6928 (10 segments ~15KB) and may cause issues in networks with shallow buffers or high latency, though appropriate for XLIO's controlled high-throughput environment; (2) The fix assumes TSO max_payload_sz is always properly initialized before tcp_set_initial_cwnd_ssthresh is called, which appears valid based on code flow but isn't explicitly validated
- tcp_in.c requires attention to verify the SYN-ACK handler change works correctly across all connection scenarios (client-side, server-side, connection reuse)
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/core/lwip/tcp.h | 5/5 | Added declaration for tcp_set_initial_cwnd_ssthresh() helper function at tcp.h:408 |
| src/core/lwip/tcp.c | 5/5 | Implemented centralized tcp_set_initial_cwnd_ssthresh() function with TSO-aware initialization (tcp.c:520-549), replaced duplicate initialization in tcp_connect(), tcp_pcb_init(), and tcp_pcb_recycle() |
| src/core/lwip/tcp_in.c | 4/5 | Critical fix at tcp_in.c:584 - replaced hardcoded ssthresh = mss * 10 with tcp_set_initial_cwnd_ssthresh() call in SYN-ACK handler, maintaining proper ordering before cc_conn_init() to prevent CC module override |
| src/core/lwip/cc_lwip.c | 5/5 | Updated lwip_conn_init() to call tcp_set_initial_cwnd_ssthresh() only when cwnd == 1, preserving values set during connection establishment |
| src/core/lwip/cc_cubic.c | 5/5 | Updated cubic_conn_init() to call tcp_set_initial_cwnd_ssthresh() only when cwnd == 1, preserving values set during connection establishment and maintaining max_cwnd tracking |
Sequence Diagram
sequenceDiagram
participant App as Application
participant TCP as tcp_connect()
participant SYN as SYN-ACK Handler<br/>(tcp_in.c:584)
participant Helper as tcp_set_initial_cwnd_ssthresh()
participant CC as CC Module<br/>(lwip/cubic)
Note over App,CC: Client-Side Connection Establishment
App->>TCP: tcp_connect()
TCP->>Helper: tcp_set_initial_cwnd_ssthresh()
Helper->>Helper: Check tcp_tso(pcb)
alt TSO Enabled
Helper->>TCP: cwnd = max_payload_sz/4 (64KB)<br/>ssthresh = 0x7FFFFFFF (2GB)
else Non-TSO
Helper->>TCP: cwnd = RFC 3390 (4*MSS)<br/>ssthresh = 10*MSS
end
TCP->>TCP: Send SYN packet
Note over SYN: Receive SYN-ACK from server
SYN->>SYN: tcp_parseopt() - negotiate MSS
SYN->>SYN: Update pcb->mss with negotiated value
Note over SYN,Helper: CRITICAL FIX: Re-initialize after MSS negotiation
SYN->>Helper: tcp_set_initial_cwnd_ssthresh()
Helper->>Helper: Check tcp_tso(pcb)
alt TSO Enabled
Note right of Helper: TSO parameters are independent<br/>of negotiated MSS
Helper->>SYN: cwnd = max_payload_sz/4 (64KB)<br/>ssthresh = 0x7FFFFFFF (2GB)
else Non-TSO
Note right of Helper: Scale with negotiated MSS
Helper->>SYN: cwnd = min(4*MSS, max(2*MSS, 4380))<br/>ssthresh = 10*MSS
end
SYN->>CC: cc_conn_init()
CC->>CC: Check if cwnd == 1
alt cwnd != 1 (already initialized)
Note right of CC: Preserve cwnd/ssthresh<br/>set by Helper function
CC->>CC: Skip re-initialization
else cwnd == 1 (uninitialized)
CC->>Helper: tcp_set_initial_cwnd_ssthresh()
end
CC->>CC: Initialize CC-specific state<br/>(e.g., cubic max_cwnd = cwnd)
SYN->>App: Connection ESTABLISHED
Note over App,CC: TCP now in slow start with proper cwnd/ssthresh<br/>TSO: Exponential growth from 64KB<br/>Non-TSO: RFC-compliant growth from 4*MSS
5 files reviewed, no comments
d24163b to
3d9aeb0
Compare
Greptile Summary
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant App as "Application (TCP Client)"
participant XLIO as "XLIO TCP Stack"
participant Network
User->>App: Initiate connection
App->>XLIO: tcp_connect()
XLIO->>XLIO: tcp_set_initial_cwnd_ssthresh()<br/>(TSO: cwnd=64KB, ssthresh=2GB)
XLIO->>Network: Send SYN
Network->>XLIO: Receive SYN-ACK (MSS negotiated)
XLIO->>XLIO: tcp_set_initial_cwnd_ssthresh()<br/>(Re-initialize after MSS)
XLIO->>XLIO: cc_conn_init()<br/>(CC algorithm initialization)
XLIO->>Network: Send ACK (Connection established)
loop Data Transfer (Slow Start)
App->>XLIO: Send data
XLIO->>Network: Send segments (up to cwnd)
Network->>XLIO: Receive ACK
XLIO->>XLIO: cc_ack_received()<br/>(cwnd += acked, exponential growth)
Note over XLIO: cwnd grows: 64KB→128KB→256KB...
end
alt Congestion Event (RTO)
XLIO->>XLIO: tcp_reset_cwnd_on_congestion(is_rto=true)<br/>(TSO: cwnd=26KB, ssthresh=64KB)
XLIO->>Network: Retransmit lost segment
loop Recovery
Network->>XLIO: Receive ACK
XLIO->>XLIO: cc_ack_received()<br/>(Slow start from 26KB)
end
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, no comments
Resolve 30-second throughput ramp-up issue for TSO-enabled
TCP connections by fixing multiple congestion control bugs
and implementing TSO-aware initial window and RTO recovery
parameters.
Problem:
Applications using TSO experienced ~30 seconds of near-zero
throughput before achieving line-rate. Investigation revealed
multiple interrelated bugs in XLIO's congestion control
implementation:
1. ssthresh Reset Bug: ssthresh was unconditionally reset to
10*MSS (14,600 bytes) during SYN-ACK processing in
tcp_in.c, forcing TCP into congestion avoidance mode
immediately instead of slow start.
2. Slow Start Algorithm Bug: Both LWIP and CUBIC modules
implemented incorrect linear cwnd growth (cwnd += mss)
instead of RFC 5681's exponential growth (cwnd += acked).
This is particularly devastating for TSO where one ACK can
acknowledge 64KB (44 segments), but cwnd only grew by
1 MSS (1460 bytes).
3. RTO Recovery Bug: On retransmission timeout, cwnd was
reset to 1 MSS per RFC 5681, which is too conservative for
modern TSO hardware that sends 256KB super-packets, causing
20+ second recovery times.
Root Cause Analysis:
XLIO's CUBIC implementation is based on FreeBSD CUBIC
(2007-2010), which contained the slow start bug. Modern CUBIC
(Linux kernel 2024, RFC 9438) uses standard TCP slow start
with exponential growth. Verification against Linux kernel
source (net/ipv4/tcp_cubic.c) confirmed that XLIO's behavior
was incorrect and outdated.
Solution:
1. Centralized Initial Window Management:
Created tcp_set_initial_cwnd_ssthresh() to consistently
set TSO-aware parameters across all connection
initialization paths:
For TSO-enabled connections:
- cwnd = TSO_max_payload / 4 (64KB with default 256KB)
- ssthresh = 0x7FFFFFFF (2GB - effectively unlimited)
For non-TSO connections:
- cwnd = RFC 3390: min(4*MSS, max(2*MSS, 4380 bytes))
- ssthresh = 10 * MSS
2. Fixed Slow Start Algorithm:
Changed cwnd increment from mss to acked in both CC
modules:
Before (WRONG):
pcb->cwnd += pcb->mss; // Linear growth
After (CORRECT):
pcb->cwnd += pcb->acked; // Exponential per RFC 5681
This matches modern implementations including Linux CUBIC
and is critical for TSO where one ACK can acknowledge many
segments.
3. TSO-Aware RTO Recovery:
Created tcp_reset_cwnd_on_congestion() for TSO-aware RTO
handling:
For TSO connections (deviates from RFC 5681):
- cwnd = 26KB (10% of TSO max) instead of 1 MSS
- ssthresh = 64KB (25% of TSO max) for slow start
For non-TSO connections (RFC 5681 compliant):
- cwnd = 1 MSS
- ssthresh = max(FlightSize/2, 2*MSS)
Rationale: RFC 5681 predates aggressive TSO. Even Linux
CUBIC (which follows RFC) suffers from slow RTO recovery
with large TSO. Our TSO-aware approach balances fast
recovery with congestion safety.
Implementation Details:
- Consolidated initialization logic in 6 locations:
* tcp_pcb_init() - initial PCB setup
* tcp_pcb_recycle() - PCB reuse after TIME_WAIT
* tcp_connect() - client-side connection initiation
* tcp_in.c SYN-ACK handler
* lwip_conn_init() - LWIP CC module initialization
* cubic_conn_init() - CUBIC CC module initialization
- Fixed slow start in 2 locations:
* cc_lwip.c:lwip_ack_received() - changed mss to acked
* cc_cubic.c:cubic_ack_received() - changed mss to acked
- Centralized RTO recovery in 2 locations:
* cc_lwip.c:lwip_cong_signal() - calls helper
* cc_cubic.c:cubic_cong_signal() - calls helper
Performance Impact:
Before: 20+ seconds to reach line-rate (200-300 Gbps)
After: Line-rate achieved in <1 second
Technical Rationale:
1. Very High ssthresh (2GB): Follows industry best practices
for high-BDP networks, allowing slow start to discover
optimal window rather than artificially limiting growth.
Standard TCP behavior for modern data center deployments.
2. TSO-Independence: TSO max payload is determined by
hardware capabilities, not negotiated MSS. Initial window
should similarly be independent of MSS for TSO
connections.
3. Initial cwnd (64KB): Balances aggressive throughput with
conservative buffer management. Exceeds RFC 6928's 10
segments (~15KB) but appropriate for XLIO's controlled
environment where TSO hardware handles segmentation and
applications target high-throughput scenarios.
4. RTO Recovery (26KB): More conservative than initial window
(25% vs 10% of TSO max) to balance fast recovery with
safety. While this deviates from RFC 5681 (1 MSS), it
recognizes the reality that 1460 bytes is artificially
small when hardware sends 256KB super-packets.
RFC Compliance Notes:
- Initial window and slow start: Compliant with RFC 5681/3390
spirit, optimized for TSO hardware.
- RTO recovery: Intentionally deviates from RFC 5681 for TSO
to address modern hardware reality. Non-TSO connections
remain RFC-compliant.
Comparison to Modern Implementations:
- Linux CUBIC (2024): Uses standard slow start
(cwnd += acked)
- Linux CUBIC RTO: Follows RFC 5681 (cwnd = 1 MSS), which
causes same slow recovery issue with aggressive TSO that we
fix here
- FreeBSD CUBIC (2007-2010): Had slow start bug
(cwnd += mss) that XLIO inherited and we now fix
References:
- RFC 3390: Increasing TCP's Initial Window
- RFC 5681: TCP Congestion Control (slow start, RTO behavior)
- RFC 6928: Increasing TCP's Initial Window (10 segments)
- RFC 9438: CUBIC for Fast and Long-Distance Networks (2023,
obsoletes RFC 8312)
- Linux kernel: net/ipv4/tcp_cubic.c (verified modern CUBIC
behavior)
https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_cubic.c
- Excentis: "Optimizing TCP Congestion Avoidance Parameters
for Gigabit Networks" - recommends very high ssthresh for
fast networks
Verification:
- GDB debugging confirmed ssthresh overwrite during SYN-ACK
processing
- Post-fix: cwnd=64KB and ssthresh=2GB maintained through
connection setup
- Enhanced debug logs confirmed exponential cwnd growth during
slow start
- Throughput testing: 200-300 Gbps achieved in <1s
(previously 20+s)
Signed-off-by: Tomer Cabouly <[email protected]>
3d9aeb0 to
fae0026
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 1 comment
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
| * Critical for TSO where one ACK can acknowledge many segments (e.g., 64KB = 44 | ||
| * segments). | ||
| */ | ||
| pcb->cwnd += pcb->acked; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: missing overflow check - lwip implementation has this protection
| pcb->cwnd += pcb->acked; | |
| if ((u32_t)(pcb->cwnd + pcb->acked) > pcb->cwnd) { | |
| pcb->cwnd += pcb->acked; | |
| } |
| if (tcp_tso(pcb)) { | ||
| /* TSO-aware recovery (deviates from RFC 5681 and modern CUBIC): | ||
| * | ||
| * RFC 5681 & Modern CUBIC (Linux 2024): cwnd = 1 MSS (1460 bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not always MSS=1460, e.g. Jumpo frames
Description
Resolve 30-second throughput ramp-up issue for TSO-enabled TCP connections by implementing proper initial congestion window (cwnd) and slow start threshold (ssthresh) values.
Problem:
Applications using TSO experienced ~30 seconds of near-zero throughput before achieving line-rate. Debug analysis revealed that ssthresh was being unconditionally reset to 10*MSS (14,600 bytes) during SYN-ACK processing in tcp_in.c line 582, forcing TCP into congestion avoidance mode immediately. This caused linear cwnd growth instead of exponential slow start, resulting in extremely slow ramp-up.
Solution:
Created centralized helper function tcp_set_initial_cwnd_ssthresh() that sets TSO-aware parameters:
For TSO-enabled connections:
For non-TSO connections:
Technical Rationale:
Very high ssthresh (2GB) follows industry best practices, allowing slow start to run until network conditions dictate otherwise rather than artificially limiting growth (Excentis research on optimizing TCP for gigabit networks).
TSO max payload is independent of negotiated MSS (determined by hardware capabilities), so initial window should also be independent of MSS for TSO connections.
Initial cwnd of 64KB (TSO_max/4) balances aggressive throughput with conservative buffer management. This exceeds RFC 6928's recommendation of 10 segments (~15KB) but is appropriate for XLIO's controlled environment where TSO hardware handles segmentation and applications target high-throughput scenarios. Empirically verified to achieve 200 Gbps in <1 second.
Implementation Details:
Performance Impact:
Before: 20+ seconds to reach line-rate (200 Gbps)
After: Line-rate achieved in <1 second
Verification: GDB debugging confirmed ssthresh was being overwritten during SYN-ACK processing. After fix, cwnd=64KB and ssthresh=2GB are maintained throughout connection establishment, enabling exponential growth as designed.
References:
What
Fix TSO slow start with aggressive cwnd/ssthresh
Why ?
xlio_benchmark to alpha.
How ?
It is optional but for complex PRs please provide information about the design,
architecture, approach, etc.
Change type
What kind of change does this PR introduce?
Check list