-
Notifications
You must be signed in to change notification settings - Fork 29
issue: 4392235 Fixing full SQ completion #494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. Common wqe fill operation for four fill kinds in fill_wqe are unified at the end of fill_wqe 2. Making fill_wqe to return number of consumed wqebbs in uint8_t Signed-off-by: Alexander Grissik <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical Send Queue (SQ) corruption bug that occurs when the SQ becomes full or nearly full. The corruption manifested when new Work Queue Elements (WQEs) overwrote the last_signalled metadata slot and completions arrived for WQEs adjacent to last_signalled, causing incorrect completion processing. The fix introduces two main changes: (1) a new completion tracking mechanism via m_last_sq_wqe_prop_to_complete and per-WQE wqebbs fields in hw_queue_tx.h to track the oldest in-flight WQE and prevent out-of-order completions, and (2) unification of the WQE posting process by refactoring store_current_wqe_prop() into submit_wqe() and consolidating doorbell logic. The fix removes obsolete methods (reset_inflight_zc_buffers_ctx(), is_sq_wqe_prop_valid()) from ring, ring_bond, ring_simple, and dst_entry classes, centralizing zero-copy buffer management at the hardware queue layer. Additionally, the RX path in hw_queue_rx.cpp is refactored to delegate WQE posting to xlio_raw_post_recv(), mirroring the unified TX approach and ensuring consistent queue-full handling. The changes touch both TX and RX hardware queue implementations, completion queue managers, and ring abstraction layers, reflecting a structural improvement to XLIO's RDMA queue management.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| src/core/dev/hw_queue_tx.h | 4/5 | Introduces m_last_sq_wqe_prop_to_complete and wqebbs field for correct SQ completion tracking, refactors methods to unify WQE posting |
| src/core/dev/hw_queue_tx.cpp | 3/5 | Implements core logic for unified WQE posting and submission (details not fully visible in summary) |
| src/core/dev/cq_mgr_tx.cpp | 4/5 | Changes loop termination to check m_last_sq_wqe_prop_to_complete instead of is_sq_wqe_prop_valid(), updates completion pointer calculation using wqebbs |
| src/core/dev/hw_queue_rx.cpp | 4/5 | Refactors RX posting to delegate to xlio_raw_post_recv(), unifying doorbell and queue-full checks; adds unused ALIGN_WR_DOWN macro |
| src/core/proto/dst_entry.h | 4/5 | Removes reset_inflight_zc_buffers_ctx() public API method, consolidating zero-copy context management at ring layer |
| src/core/dev/ring.h | 4/5 | Removes virtual stub method reset_inflight_zc_buffers_ctx() from base ring class |
| src/core/dev/ring_simple.h | 3/5 | Removes reset_inflight_zc_buffers_ctx() override; safe if cleanup moved elsewhere but potential resource leak risk if not |
| src/core/dev/ring_bond.h | 3/5 | Removes delegation method reset_inflight_zc_buffers_ctx(); safe if callers updated or functionality absorbed |
| src/core/dev/hw_queue_rx.h | 4/5 | Contains pre-existing documentation error stating class manages SQ operations when it should say RQ operations |
Confidence score: 4/5
- This PR is safe to merge with moderate risk; the changes are well-structured and address a real SQ corruption bug.
- Score reflects that the core TX completion logic is sound and well-commented, but full verification of
hw_queue_tx.cppwas not possible due to incomplete visibility, and removal ofreset_inflight_zc_buffers_ctx()across multiple classes could cause resource leaks if zero-copy cleanup was not properly migrated to the new unified mechanism. - Pay close attention to
src/core/dev/hw_queue_tx.cpp(core WQE posting implementation),src/core/dev/ring_simple.h(verify zero-copy buffer cleanup is handled), andsrc/core/dev/cq_mgr_tx.cpp(completion loop termination logic with new pointer arithmetic).
Sequence Diagram
sequenceDiagram
participant App as Application
participant Ring as ring_simple
participant HQTX as hw_queue_tx
participant CQTX as cq_mgr_tx
participant HW as Hardware SQ/CQ
Note over App,HW: Send Operation Flow
App->>Ring: send_lwip_buffer(send_wqe)
Ring->>HQTX: send_wqe(send_wqe, credits)
HQTX->>HQTX: fill_wqe(send_wqe)
Note over HQTX: Calculate wqebbs for WQE
HQTX->>HQTX: submit_wqe(buf, credits, wqebbs, ti)
Note over HQTX: Store sq_wqe_prop with:<br/>buf, credits, wqebbs, ti, next ptr
HQTX->>HQTX: m_sq_wqe_prop_last = &m_sq_wqe_idx_to_prop[index]
Note over HQTX: Build linked list of WQE properties
HQTX->>HQTX: ring_doorbell(wqebbs)
Note over HQTX: Check if completion needed<br/>Set/decrement unsignaled count
HQTX->>HW: Post WQE to Send Queue
alt Completion Signal Requested
HQTX->>CQTX: poll_and_process_element_tx()
end
Note over App,HW: Completion Handling Flow (Fix)
HW->>CQTX: Completion Event (CQE)
CQTX->>CQTX: poll_and_process_element_tx()
CQTX->>CQTX: get_cqe_tx(num_polled_cqes)
Note over CQTX: Get CQE with wqe_counter
alt Error CQE Detected
CQTX->>CQTX: log_cqe_error(cqe, index, credits)
Note over CQTX: Log error with WQE hexdump
end
CQTX->>CQTX: handle_sq_wqe_prop(index)
Note over CQTX: Traverse linked list of WQE properties
loop Until m_last_sq_wqe_prop_to_complete
CQTX->>Ring: mem_buf_desc_return_single_locked(buf)
Note over CQTX: Return buffer if present
alt TI Present
CQTX->>CQTX: ti->m_callback()
Note over CQTX: Execute TI callback
CQTX->>CQTX: ti->put()
alt TI Released
CQTX->>CQTX: ti->ti_released()
end
end
CQTX->>CQTX: credits += p->credits
Note over CQTX: Accumulate credits for all WQEs
CQTX->>CQTX: p = p->next
end
CQTX->>Ring: return_tx_pool_to_global_pool()
CQTX->>HQTX: credits_return(credits)
Note over HQTX: m_sq_free_credits += credits
CQTX->>CQTX: m_last_sq_wqe_prop_to_complete = next_prop
Note over CQTX: Update completion pointer using wqebbs
CQTX->>CQTX: update_global_sn_tx()
Additional Comments (1)
-
src/core/dev/hw_queue_rx.h, line 21 (link)syntax: comment says "manages the SQ operations" but this class is hw_queue_rx which manages RQ operations
9 files reviewed, 1 comment
tomerdbz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! really enjoyed reading the solution @AlexanderGrissik
Some comments though:
- look at the diff - it contains weirdly hq_queue_rx - I think you accidentally added chmod u+x to them. please fix this as this is not related and probably not intended.
- Please add at least 1 test to gtest that triggers the issue (fails w/o your fix, passes with it) - this is such a gentle logic - we must make sure it's future-proof.
Unifying duplicade code of calls to store_sq_prop, ring_doorbell and update_next_wqe_hot under submit_wqe for all different WQE posting methods. Signed-off-by: Alexander Grissik <[email protected]>
5673ff9 to
3533704
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical SQ (Send Queue) corruption bug that occurs when the queue becomes full or nearly full. The issue arose when new WQEs (Work Queue Elements) were written into memory space occupied by the last signalled WQE, and when completions arrived for WQEs adjacent to the last signalled position. The fix introduces a unified WQE posting process with explicit tracking of WQE sizes in WQEBBs (Work Queue Element Basic Blocks) separate from credit accounting, replaces index-based completion tracking with a pointer-based linked-list approach, and removes fragile validation logic. The changes span the hardware queue layer (hw_queue_tx), the completion manager (cq_mgr_tx), the ring abstraction hierarchy (ring, ring_simple, ring_bond), and destination entry logic (dst_entry). A new constant XLIO_NOP_WQEBBS is added to support NOP WQE sizing. A comprehensive regression test is included that reproduces the wraparound scenario by filling the entire SQ with 16384 2-WQEBB WQEs after an initial burst, validating that completions are correctly processed without corruption.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| src/core/dev/hw_queue_tx.h | 4/5 | Adds wqebbs field to sq_wqe_prop, introduces m_last_sq_wqe_prop_to_complete pointer with detailed documentation, removes m_sq_wqe_prop_last_signalled and is_sq_wqe_prop_valid(), and updates method signatures for unified WQE posting. |
| src/core/dev/hw_queue_tx.cpp | 0/5 | Implementation file for the hw_queue_tx changes; details not provided in the diff summary. |
| src/core/dev/cq_mgr_tx.cpp | 4/5 | Changes completion loop to use pointer comparison against m_last_sq_wqe_prop_to_complete and updates the pointer by advancing by wqebbs with modular arithmetic to handle wraparound correctly. |
| src/core/ib/mlx5/ib_mlx5.h | 5/5 | Adds XLIO_NOP_WQEBBS constant (1U) for NOP WQE size in WQEBBs, consistent with existing WQE size definitions. |
| tests/gtest/xlio_ultra_api/xlio_socket_send_receive_full_sq.cc | 3/5 | New regression test that reproduces the SQ corruption scenario by filling the queue with 16384 2-WQEBB WQEs; hardcodes SQ size assumptions and uses fork-based client/server pattern. |
| src/core/dev/ring_simple.h | 3/5 | Removes reset_inflight_zc_buffers_ctx() override; suggests zero-copy buffer context management has been refactored or moved. |
| src/core/dev/ring_bond.h | 3/5 | Removes reset_inflight_zc_buffers_ctx() method; part of the unified WQE posting refactor, though zerocopy support for bonded rings remains incomplete per existing TODO. |
| src/core/proto/dst_entry.h | 4/5 | Removes reset_inflight_zc_buffers_ctx() delegation method; zero-copy buffer lifecycle now managed elsewhere in the unified flow. |
| src/core/dev/ring.h | 4/5 | Removes the reset_inflight_zc_buffers_ctx() virtual method from the ring base class interface. |
| tests/gtest/Makefile.am | 5/5 | Adds new test file xlio_socket_send_receive_full_sq.cc to the gtest build. |
Confidence score: 4/5
- This PR addresses a critical bug with a well-structured fix that unifies WQE posting and completion tracking, and includes a targeted regression test.
- Score lowered by one point due to: (1) the new test hardcodes SQ size assumptions that may not match runtime configuration, (2) removal of
reset_inflight_zc_buffers_ctx()across multiple classes without visible replacement logic in the provided diffs (the implementation details in hw_queue_tx.cpp were not shown), and (3) the test uses static variables and fork() which can introduce subtle state-sharing issues. - Pay close attention to src/core/dev/hw_queue_tx.cpp (implementation not shown in diff), the new regression test (xlio_socket_send_receive_full_sq.cc) for hardcoded assumptions, and verify that zero-copy buffer context management is correctly handled after the removal of reset methods across ring/dst_entry classes.
Sequence Diagram
sequenceDiagram
participant User
participant hw_queue_tx
participant SQ as Send Queue
participant cq_mgr_tx
participant Ring
Note over hw_queue_tx,SQ: WQE Submission Flow
User->>hw_queue_tx: send_wqe(p_send_wqe, attr, tis, credits)
hw_queue_tx->>hw_queue_tx: fill_wqe(p_send_wqe)
Note right of hw_queue_tx: Calculate WQEBBs needed
hw_queue_tx->>hw_queue_tx: submit_wqe(buf, credits, wqebbs, ti)
Note over hw_queue_tx,SQ: Store WQE properties
hw_queue_tx->>hw_queue_tx: m_sq_wqe_idx_to_prop[index] = {buf, credits, wqebbs, ti, next}
Note right of hw_queue_tx: Build linked list for completion
hw_queue_tx->>hw_queue_tx: m_sq_wqe_prop_last = &m_sq_wqe_idx_to_prop[index]
hw_queue_tx->>SQ: ring_doorbell(wqebbs)
Note right of SQ: Post WQE to hardware
Note over cq_mgr_tx,Ring: Completion Flow
User->>cq_mgr_tx: poll_and_process_element_tx()
cq_mgr_tx->>SQ: get_cqe_tx()
SQ-->>cq_mgr_tx: cqe (Completion Queue Entry)
alt CQE Error
cq_mgr_tx->>cq_mgr_tx: log_cqe_error(cqe, index, credits)
Note right of cq_mgr_tx: Log WQE details with wqe_to_hexstring
end
cq_mgr_tx->>cq_mgr_tx: handle_sq_wqe_prop(index)
Note over cq_mgr_tx: Process linked list of WQEs
loop Until last completed WQE
cq_mgr_tx->>Ring: mem_buf_desc_return_single_locked(buf)
Note right of cq_mgr_tx: Return buffer to pool
alt Has TI (Transport Interface)
cq_mgr_tx->>hw_queue_tx: ti->put()
Note right of hw_queue_tx: Release TIS/TIR reference
end
cq_mgr_tx->>hw_queue_tx: credits_return(credits)
Note right of hw_queue_tx: Free SQ credits (WQEBBs)
cq_mgr_tx->>cq_mgr_tx: p = p->next
Note right of cq_mgr_tx: Move to next WQE in chain
end
Note over cq_mgr_tx,hw_queue_tx: Update completion tracking
cq_mgr_tx->>hw_queue_tx: m_last_sq_wqe_prop_to_complete = next_index
Note right of hw_queue_tx: Track last completed WQE
cq_mgr_tx->>Ring: return_tx_pool_to_global_pool()
10 files reviewed, 4 comments
|
Added gtest based on Ultra API for that scenario. |
3533704 to
4f7ab36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical SQ (Send Queue) corruption bug that occurs when XLIO fills or nearly fills the hardware send queue. The root cause was that WQEs could be written to the last_signalled space and completions would arrive for adjacent WQEs, causing index tracking to become inconsistent. The fix introduces pointer-based completion tracking (m_last_sq_wqe_prop_to_complete) instead of the previous index-based approach (m_sq_wqe_prop_last_signalled), and unifies all WQE posting through a single submit_wqe() bottleneck. The centralized submission ensures that buffer descriptors, credit accounting, and WQE size in basic blocks (wqebbs) are stored atomically before ringing the doorbell. A new gtest validates the fix by deliberately filling the SQ with messages and verifying completions don't corrupt memory, and CI now includes a "special test" run with CUBIC congestion control and TCP_NODELAY to stress the full-queue scenario.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| src/core/dev/hw_queue_tx.cpp | 4/5 | Implements centralized WQE posting via new submit_wqe() method; all WQE types now store properties atomically before doorbell |
| src/core/dev/hw_queue_tx.h | 4/5 | Adds wqebbs field to sq_wqe_prop and replaces index-based m_sq_wqe_prop_last_signalled with pointer-based m_last_sq_wqe_prop_to_complete |
| src/core/dev/cq_mgr_tx.cpp | 3.5/5 | Changes completion loop termination to use explicit pointer comparison; updates completion marker using wqebbs for wraparound calculation |
| tests/gtest/xlio_ultra_api/xlio_socket_send_receive_full_sq.cc | 3/5 | New test that fills SQ to capacity; hardcodes default SQ size and calls SetUp after fork which may cause race conditions |
| contrib/jenkins_tests/gtest.sh | 4/5 | Adds "special tests" configuration using CUBIC and TCP_NODELAY to stress full-SQ scenario in CI |
| tests/gtest/Makefile.am | 5/5 | Adds new test file to gtest build for full-SQ regression coverage |
| src/core/dev/ring_simple.h | 5/5 | Removes reset_inflight_zc_buffers_ctx() method wrapper as part of code consolidation |
| src/core/dev/ring_bond.h | 4/5 | Removes reset_inflight_zc_buffers_ctx() delegation to slave rings; consistent with incomplete zerocopy support |
| src/core/dev/ring.h | 4/5 | Removes unused virtual method reset_inflight_zc_buffers_ctx() from base class interface |
| src/core/proto/dst_entry.h | 4/5 | Removes thin wrapper method that delegated buffer context reset to ring layer |
| src/core/ib/mlx5/ib_mlx5.h | 0/5 | Summary of changes not provided in the PR context; requires investigation |
Confidence score: 3/5
- This PR contains a complex fix to a critical race condition in the SQ completion path; the architectural change from index-based to pointer-based tracking is sound but requires careful validation.
- Score reflects: (1) potential for uninitialized
m_last_sq_wqe_prop_to_completecausing infinite loops if initialization at line 443 of hw_queue_tx.cpp is missed in any execution path, (2) the new test has race conditions (SetUp called after fork) and hardcoded assumptions about SQ size that could break silently, (3) missing information about changes in ib_mlx5.h (marked confidence 0/5), (4) no documentation updates despite architectural changes to completion tracking, and (5) the pointer arithmetic in cq_mgr_tx.cpp line 293-295 depends on correctwqebbsvalues in all code paths. - Pay close attention to src/core/dev/cq_mgr_tx.cpp (completion loop pointer arithmetic), tests/gtest/xlio_ultra_api/xlio_socket_send_receive_full_sq.cc (fork/SetUp ordering and hardcoded values), and verify that src/core/ib/mlx5/ib_mlx5.h changes are correctly integrated.
Sequence Diagram
sequenceDiagram
participant App as Application
participant Ring as ring_simple
participant HQTX as hw_queue_tx
participant CQ as cq_mgr_tx
participant HW as Hardware SQ/CQ
Note over App,HW: Initial Setup - Connection Established
App->>Ring: xlio_socket_send(data, send_attr)
Ring->>HQTX: send_wqe(p_send_wqe, attr, tis, credits)
HQTX->>HQTX: send_to_wire(p_send_wqe, attr, request_comp, tis, credits)
HQTX->>HQTX: fill_wqe(p_send_wqe)
Note over HQTX: Calculate WQEBBs needed
HQTX->>HQTX: submit_wqe(buf, credits, wqebbs, ti, skip_comp)
Note over HQTX: Store WQE properties in linked list<br/>m_sq_wqe_idx_to_prop[index] = {buf, credits, wqebbs, ti, next}
HQTX->>HQTX: m_sq_wqe_prop_last = &m_sq_wqe_idx_to_prop[index]
HQTX->>HQTX: ring_doorbell(wqebbs, skip_comp)
Note over HQTX: Update WQE counter<br/>m_sq_wqe_counter += wqebbs
HQTX->>HW: Ring doorbell (BF write)
Note over App,HW: Continue sending to fill SQ
loop Fill SQ with multiple WQEs
App->>Ring: xlio_socket_send(data, send_attr)
Ring->>HQTX: send_wqe(...)
HQTX->>HQTX: submit_wqe(...)
Note over HQTX: Link current WQE to previous<br/>prop.next = m_sq_wqe_prop_last
HQTX->>HW: Ring doorbell
end
Note over App,HW: Poll for Completions
App->>Ring: xlio_poll_group_poll(group)
Ring->>CQ: poll_and_process_element_tx(&poll_sn)
CQ->>CQ: get_cqe_tx(num_polled_cqes)
CQ->>HW: Read CQE from completion queue
HW-->>CQ: CQE with wqe_counter
alt CQE has error
CQ->>CQ: log_cqe_error(cqe, index, credits)
Note over CQ: Log WQE hex dump and error info
CQ->>CQ: Mark buffer with HAD_CQE_ERROR flag
end
CQ->>CQ: handle_sq_wqe_prop(index)
Note over CQ: Process linked list of completed WQEs
loop Until reaching m_last_sq_wqe_prop_to_complete
CQ->>CQ: p = &m_sq_wqe_idx_to_prop[index]
alt Buffer exists
CQ->>Ring: mem_buf_desc_return_single_locked(p->buf)
end
alt TI exists
CQ->>CQ: ti->m_callback(ti->m_callback_arg)
CQ->>CQ: ti->put()
alt TI released and ref==0
CQ->>CQ: ti->ti_released()
end
end
CQ->>CQ: credits_return(p->credits)
CQ->>CQ: p = p->next
end
CQ->>CQ: m_last_sq_wqe_prop_to_complete = next WQE prop
CQ->>Ring: return_tx_pool_to_global_pool()
CQ->>HQTX: credits_return(credits)
Note over HQTX: m_sq_free_credits += credits
CQ->>CQ: update_global_sn_tx(poll_sn, num_polled_cqes)
Note over App,HW: Credits returned, can send more
11 files reviewed, 4 comments
|
@AlexanderGrissik , please resolve conversations. |
We tracked the index of last cmpleted WQE, then on each CQE we iterate from the CQE index back until we reach last completed WQE. However, since this WQE is considered free more than one WQE can be written in its space especially when we fill SQ to maximum. Consider the following state: | WQE3 | last_signalled (4 wqebb) | WQE1 | WQE2 | Now since last_signalled space is considered free we may add more WQE, for instance, | WQE3 | WQE4 (2wqebb, last_signalled_ptr) | WQE5 (1 wqebb) | WQE6 (1 wqebb) | WQE1 | WQE2 | Now if we get completion for WQE6, we would complete untill last_signalled (WQE6 and WQE5), leaking the completions for WQE1,2,3,4. This results in credits and resource leakage. To resolve this issue, we replace the last_signalled with last_to_complete and keep the size of wqe in wqebbs per each sq-prop. last_to_complete always points to the left most uncompleted wqe. Given the example above, last_to_complete will point to WQE1. After completion loop last_to_complete will point to the next WQE index to be completed being it already psted or not. Given | WQE1 (last_to_complete) | WQE2 | WQE3 |, on CQE for WQE3, we complete WQE3,WQE2,WQE1 and move last_to_complete to WQE4. We can do this since we know the size of each WQE in wqebbs. Removing unused reset_inflight_zc_buffers_ctx. Signed-off-by: Alexander Grissik <[email protected]>
c6ddd8e
4f7ab36 to
c6ddd8e
Compare
Description
Fixing SQ corruption on full SQ completion
See commit messages for full details
What
Why ?
SQ corruption
How ?
See commit message.
Change type
What kind of change does this PR introduce?
Check list