Skip to content

Conversation

@AlexanderGrissik
Copy link
Collaborator

Description

Fixing SQ corruption on full SQ completion
See commit messages for full details

What
  1. Resolving SQ corruption when we fill the SQ or almost fill the SQ writing new WQEs on last_signalled space and receiving completion on WQE right to last_ignalled.
  2. Unifying WQE posting process.
Why ?

SQ corruption

How ?

See commit message.

Change type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Check list

  • Code follows the style de facto guidelines of this project
  • Comments have been inserted in hard to understand places
  • Documentation has been updated (if necessary)
  • Test has been added (if possible)

1. Common wqe fill operation for four fill kinds in fill_wqe are unified at the end of fill_wqe
2. Making fill_wqe to return number of consumed wqebbs in uint8_t

Signed-off-by: Alexander Grissik <[email protected]>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical Send Queue (SQ) corruption bug that occurs when the SQ becomes full or nearly full. The corruption manifested when new Work Queue Elements (WQEs) overwrote the last_signalled metadata slot and completions arrived for WQEs adjacent to last_signalled, causing incorrect completion processing. The fix introduces two main changes: (1) a new completion tracking mechanism via m_last_sq_wqe_prop_to_complete and per-WQE wqebbs fields in hw_queue_tx.h to track the oldest in-flight WQE and prevent out-of-order completions, and (2) unification of the WQE posting process by refactoring store_current_wqe_prop() into submit_wqe() and consolidating doorbell logic. The fix removes obsolete methods (reset_inflight_zc_buffers_ctx(), is_sq_wqe_prop_valid()) from ring, ring_bond, ring_simple, and dst_entry classes, centralizing zero-copy buffer management at the hardware queue layer. Additionally, the RX path in hw_queue_rx.cpp is refactored to delegate WQE posting to xlio_raw_post_recv(), mirroring the unified TX approach and ensuring consistent queue-full handling. The changes touch both TX and RX hardware queue implementations, completion queue managers, and ring abstraction layers, reflecting a structural improvement to XLIO's RDMA queue management.

Important Files Changed

Filename Score Overview
src/core/dev/hw_queue_tx.h 4/5 Introduces m_last_sq_wqe_prop_to_complete and wqebbs field for correct SQ completion tracking, refactors methods to unify WQE posting
src/core/dev/hw_queue_tx.cpp 3/5 Implements core logic for unified WQE posting and submission (details not fully visible in summary)
src/core/dev/cq_mgr_tx.cpp 4/5 Changes loop termination to check m_last_sq_wqe_prop_to_complete instead of is_sq_wqe_prop_valid(), updates completion pointer calculation using wqebbs
src/core/dev/hw_queue_rx.cpp 4/5 Refactors RX posting to delegate to xlio_raw_post_recv(), unifying doorbell and queue-full checks; adds unused ALIGN_WR_DOWN macro
src/core/proto/dst_entry.h 4/5 Removes reset_inflight_zc_buffers_ctx() public API method, consolidating zero-copy context management at ring layer
src/core/dev/ring.h 4/5 Removes virtual stub method reset_inflight_zc_buffers_ctx() from base ring class
src/core/dev/ring_simple.h 3/5 Removes reset_inflight_zc_buffers_ctx() override; safe if cleanup moved elsewhere but potential resource leak risk if not
src/core/dev/ring_bond.h 3/5 Removes delegation method reset_inflight_zc_buffers_ctx(); safe if callers updated or functionality absorbed
src/core/dev/hw_queue_rx.h 4/5 Contains pre-existing documentation error stating class manages SQ operations when it should say RQ operations

Confidence score: 4/5

  • This PR is safe to merge with moderate risk; the changes are well-structured and address a real SQ corruption bug.
  • Score reflects that the core TX completion logic is sound and well-commented, but full verification of hw_queue_tx.cpp was not possible due to incomplete visibility, and removal of reset_inflight_zc_buffers_ctx() across multiple classes could cause resource leaks if zero-copy cleanup was not properly migrated to the new unified mechanism.
  • Pay close attention to src/core/dev/hw_queue_tx.cpp (core WQE posting implementation), src/core/dev/ring_simple.h (verify zero-copy buffer cleanup is handled), and src/core/dev/cq_mgr_tx.cpp (completion loop termination logic with new pointer arithmetic).

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Ring as ring_simple
    participant HQTX as hw_queue_tx
    participant CQTX as cq_mgr_tx
    participant HW as Hardware SQ/CQ

    Note over App,HW: Send Operation Flow
    App->>Ring: send_lwip_buffer(send_wqe)
    Ring->>HQTX: send_wqe(send_wqe, credits)
    HQTX->>HQTX: fill_wqe(send_wqe)
    Note over HQTX: Calculate wqebbs for WQE
    
    HQTX->>HQTX: submit_wqe(buf, credits, wqebbs, ti)
    Note over HQTX: Store sq_wqe_prop with:<br/>buf, credits, wqebbs, ti, next ptr
    HQTX->>HQTX: m_sq_wqe_prop_last = &m_sq_wqe_idx_to_prop[index]
    Note over HQTX: Build linked list of WQE properties
    
    HQTX->>HQTX: ring_doorbell(wqebbs)
    Note over HQTX: Check if completion needed<br/>Set/decrement unsignaled count
    HQTX->>HW: Post WQE to Send Queue
    
    alt Completion Signal Requested
        HQTX->>CQTX: poll_and_process_element_tx()
    end

    Note over App,HW: Completion Handling Flow (Fix)
    HW->>CQTX: Completion Event (CQE)
    CQTX->>CQTX: poll_and_process_element_tx()
    CQTX->>CQTX: get_cqe_tx(num_polled_cqes)
    Note over CQTX: Get CQE with wqe_counter
    
    alt Error CQE Detected
        CQTX->>CQTX: log_cqe_error(cqe, index, credits)
        Note over CQTX: Log error with WQE hexdump
    end
    
    CQTX->>CQTX: handle_sq_wqe_prop(index)
    Note over CQTX: Traverse linked list of WQE properties
    
    loop Until m_last_sq_wqe_prop_to_complete
        CQTX->>Ring: mem_buf_desc_return_single_locked(buf)
        Note over CQTX: Return buffer if present
        
        alt TI Present
            CQTX->>CQTX: ti->m_callback()
            Note over CQTX: Execute TI callback
            CQTX->>CQTX: ti->put()
            alt TI Released
                CQTX->>CQTX: ti->ti_released()
            end
        end
        
        CQTX->>CQTX: credits += p->credits
        Note over CQTX: Accumulate credits for all WQEs
        CQTX->>CQTX: p = p->next
    end
    
    CQTX->>Ring: return_tx_pool_to_global_pool()
    CQTX->>HQTX: credits_return(credits)
    Note over HQTX: m_sq_free_credits += credits
    
    CQTX->>CQTX: m_last_sq_wqe_prop_to_complete = next_prop
    Note over CQTX: Update completion pointer using wqebbs
    
    CQTX->>CQTX: update_global_sn_tx()
Loading

Additional Comments (1)

  1. src/core/dev/hw_queue_rx.h, line 21 (link)

    syntax: comment says "manages the SQ operations" but this class is hw_queue_rx which manages RQ operations

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@tomerdbz tomerdbz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! really enjoyed reading the solution @AlexanderGrissik

Some comments though:

  1. look at the diff - it contains weirdly hq_queue_rx - I think you accidentally added chmod u+x to them. please fix this as this is not related and probably not intended.
  2. Please add at least 1 test to gtest that triggers the issue (fails w/o your fix, passes with it) - this is such a gentle logic - we must make sure it's future-proof.

Unifying duplicade code of calls to store_sq_prop, ring_doorbell and update_next_wqe_hot under submit_wqe for all different WQE posting methods.

Signed-off-by: Alexander Grissik <[email protected]>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical SQ (Send Queue) corruption bug that occurs when the queue becomes full or nearly full. The issue arose when new WQEs (Work Queue Elements) were written into memory space occupied by the last signalled WQE, and when completions arrived for WQEs adjacent to the last signalled position. The fix introduces a unified WQE posting process with explicit tracking of WQE sizes in WQEBBs (Work Queue Element Basic Blocks) separate from credit accounting, replaces index-based completion tracking with a pointer-based linked-list approach, and removes fragile validation logic. The changes span the hardware queue layer (hw_queue_tx), the completion manager (cq_mgr_tx), the ring abstraction hierarchy (ring, ring_simple, ring_bond), and destination entry logic (dst_entry). A new constant XLIO_NOP_WQEBBS is added to support NOP WQE sizing. A comprehensive regression test is included that reproduces the wraparound scenario by filling the entire SQ with 16384 2-WQEBB WQEs after an initial burst, validating that completions are correctly processed without corruption.

Important Files Changed

Filename Score Overview
src/core/dev/hw_queue_tx.h 4/5 Adds wqebbs field to sq_wqe_prop, introduces m_last_sq_wqe_prop_to_complete pointer with detailed documentation, removes m_sq_wqe_prop_last_signalled and is_sq_wqe_prop_valid(), and updates method signatures for unified WQE posting.
src/core/dev/hw_queue_tx.cpp 0/5 Implementation file for the hw_queue_tx changes; details not provided in the diff summary.
src/core/dev/cq_mgr_tx.cpp 4/5 Changes completion loop to use pointer comparison against m_last_sq_wqe_prop_to_complete and updates the pointer by advancing by wqebbs with modular arithmetic to handle wraparound correctly.
src/core/ib/mlx5/ib_mlx5.h 5/5 Adds XLIO_NOP_WQEBBS constant (1U) for NOP WQE size in WQEBBs, consistent with existing WQE size definitions.
tests/gtest/xlio_ultra_api/xlio_socket_send_receive_full_sq.cc 3/5 New regression test that reproduces the SQ corruption scenario by filling the queue with 16384 2-WQEBB WQEs; hardcodes SQ size assumptions and uses fork-based client/server pattern.
src/core/dev/ring_simple.h 3/5 Removes reset_inflight_zc_buffers_ctx() override; suggests zero-copy buffer context management has been refactored or moved.
src/core/dev/ring_bond.h 3/5 Removes reset_inflight_zc_buffers_ctx() method; part of the unified WQE posting refactor, though zerocopy support for bonded rings remains incomplete per existing TODO.
src/core/proto/dst_entry.h 4/5 Removes reset_inflight_zc_buffers_ctx() delegation method; zero-copy buffer lifecycle now managed elsewhere in the unified flow.
src/core/dev/ring.h 4/5 Removes the reset_inflight_zc_buffers_ctx() virtual method from the ring base class interface.
tests/gtest/Makefile.am 5/5 Adds new test file xlio_socket_send_receive_full_sq.cc to the gtest build.

Confidence score: 4/5

  • This PR addresses a critical bug with a well-structured fix that unifies WQE posting and completion tracking, and includes a targeted regression test.
  • Score lowered by one point due to: (1) the new test hardcodes SQ size assumptions that may not match runtime configuration, (2) removal of reset_inflight_zc_buffers_ctx() across multiple classes without visible replacement logic in the provided diffs (the implementation details in hw_queue_tx.cpp were not shown), and (3) the test uses static variables and fork() which can introduce subtle state-sharing issues.
  • Pay close attention to src/core/dev/hw_queue_tx.cpp (implementation not shown in diff), the new regression test (xlio_socket_send_receive_full_sq.cc) for hardcoded assumptions, and verify that zero-copy buffer context management is correctly handled after the removal of reset methods across ring/dst_entry classes.

Sequence Diagram

sequenceDiagram
    participant User
    participant hw_queue_tx
    participant SQ as Send Queue
    participant cq_mgr_tx
    participant Ring

    Note over hw_queue_tx,SQ: WQE Submission Flow
    User->>hw_queue_tx: send_wqe(p_send_wqe, attr, tis, credits)
    hw_queue_tx->>hw_queue_tx: fill_wqe(p_send_wqe)
    Note right of hw_queue_tx: Calculate WQEBBs needed
    hw_queue_tx->>hw_queue_tx: submit_wqe(buf, credits, wqebbs, ti)
    
    Note over hw_queue_tx,SQ: Store WQE properties
    hw_queue_tx->>hw_queue_tx: m_sq_wqe_idx_to_prop[index] = {buf, credits, wqebbs, ti, next}
    Note right of hw_queue_tx: Build linked list for completion
    hw_queue_tx->>hw_queue_tx: m_sq_wqe_prop_last = &m_sq_wqe_idx_to_prop[index]
    
    hw_queue_tx->>SQ: ring_doorbell(wqebbs)
    Note right of SQ: Post WQE to hardware
    
    Note over cq_mgr_tx,Ring: Completion Flow
    User->>cq_mgr_tx: poll_and_process_element_tx()
    cq_mgr_tx->>SQ: get_cqe_tx()
    SQ-->>cq_mgr_tx: cqe (Completion Queue Entry)
    
    alt CQE Error
        cq_mgr_tx->>cq_mgr_tx: log_cqe_error(cqe, index, credits)
        Note right of cq_mgr_tx: Log WQE details with wqe_to_hexstring
    end
    
    cq_mgr_tx->>cq_mgr_tx: handle_sq_wqe_prop(index)
    
    Note over cq_mgr_tx: Process linked list of WQEs
    loop Until last completed WQE
        cq_mgr_tx->>Ring: mem_buf_desc_return_single_locked(buf)
        Note right of cq_mgr_tx: Return buffer to pool
        
        alt Has TI (Transport Interface)
            cq_mgr_tx->>hw_queue_tx: ti->put()
            Note right of hw_queue_tx: Release TIS/TIR reference
        end
        
        cq_mgr_tx->>hw_queue_tx: credits_return(credits)
        Note right of hw_queue_tx: Free SQ credits (WQEBBs)
        
        cq_mgr_tx->>cq_mgr_tx: p = p->next
        Note right of cq_mgr_tx: Move to next WQE in chain
    end
    
    Note over cq_mgr_tx,hw_queue_tx: Update completion tracking
    cq_mgr_tx->>hw_queue_tx: m_last_sq_wqe_prop_to_complete = next_index
    Note right of hw_queue_tx: Track last completed WQE
    
    cq_mgr_tx->>Ring: return_tx_pool_to_global_pool()
Loading

10 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@AlexanderGrissik
Copy link
Collaborator Author

Added gtest based on Ultra API for that scenario.
Removed mode change files.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical SQ (Send Queue) corruption bug that occurs when XLIO fills or nearly fills the hardware send queue. The root cause was that WQEs could be written to the last_signalled space and completions would arrive for adjacent WQEs, causing index tracking to become inconsistent. The fix introduces pointer-based completion tracking (m_last_sq_wqe_prop_to_complete) instead of the previous index-based approach (m_sq_wqe_prop_last_signalled), and unifies all WQE posting through a single submit_wqe() bottleneck. The centralized submission ensures that buffer descriptors, credit accounting, and WQE size in basic blocks (wqebbs) are stored atomically before ringing the doorbell. A new gtest validates the fix by deliberately filling the SQ with messages and verifying completions don't corrupt memory, and CI now includes a "special test" run with CUBIC congestion control and TCP_NODELAY to stress the full-queue scenario.

Important Files Changed

Filename Score Overview
src/core/dev/hw_queue_tx.cpp 4/5 Implements centralized WQE posting via new submit_wqe() method; all WQE types now store properties atomically before doorbell
src/core/dev/hw_queue_tx.h 4/5 Adds wqebbs field to sq_wqe_prop and replaces index-based m_sq_wqe_prop_last_signalled with pointer-based m_last_sq_wqe_prop_to_complete
src/core/dev/cq_mgr_tx.cpp 3.5/5 Changes completion loop termination to use explicit pointer comparison; updates completion marker using wqebbs for wraparound calculation
tests/gtest/xlio_ultra_api/xlio_socket_send_receive_full_sq.cc 3/5 New test that fills SQ to capacity; hardcodes default SQ size and calls SetUp after fork which may cause race conditions
contrib/jenkins_tests/gtest.sh 4/5 Adds "special tests" configuration using CUBIC and TCP_NODELAY to stress full-SQ scenario in CI
tests/gtest/Makefile.am 5/5 Adds new test file to gtest build for full-SQ regression coverage
src/core/dev/ring_simple.h 5/5 Removes reset_inflight_zc_buffers_ctx() method wrapper as part of code consolidation
src/core/dev/ring_bond.h 4/5 Removes reset_inflight_zc_buffers_ctx() delegation to slave rings; consistent with incomplete zerocopy support
src/core/dev/ring.h 4/5 Removes unused virtual method reset_inflight_zc_buffers_ctx() from base class interface
src/core/proto/dst_entry.h 4/5 Removes thin wrapper method that delegated buffer context reset to ring layer
src/core/ib/mlx5/ib_mlx5.h 0/5 Summary of changes not provided in the PR context; requires investigation

Confidence score: 3/5

  • This PR contains a complex fix to a critical race condition in the SQ completion path; the architectural change from index-based to pointer-based tracking is sound but requires careful validation.
  • Score reflects: (1) potential for uninitialized m_last_sq_wqe_prop_to_complete causing infinite loops if initialization at line 443 of hw_queue_tx.cpp is missed in any execution path, (2) the new test has race conditions (SetUp called after fork) and hardcoded assumptions about SQ size that could break silently, (3) missing information about changes in ib_mlx5.h (marked confidence 0/5), (4) no documentation updates despite architectural changes to completion tracking, and (5) the pointer arithmetic in cq_mgr_tx.cpp line 293-295 depends on correct wqebbs values in all code paths.
  • Pay close attention to src/core/dev/cq_mgr_tx.cpp (completion loop pointer arithmetic), tests/gtest/xlio_ultra_api/xlio_socket_send_receive_full_sq.cc (fork/SetUp ordering and hardcoded values), and verify that src/core/ib/mlx5/ib_mlx5.h changes are correctly integrated.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Ring as ring_simple
    participant HQTX as hw_queue_tx
    participant CQ as cq_mgr_tx
    participant HW as Hardware SQ/CQ

    Note over App,HW: Initial Setup - Connection Established
    App->>Ring: xlio_socket_send(data, send_attr)
    Ring->>HQTX: send_wqe(p_send_wqe, attr, tis, credits)
    HQTX->>HQTX: send_to_wire(p_send_wqe, attr, request_comp, tis, credits)
    HQTX->>HQTX: fill_wqe(p_send_wqe)
    Note over HQTX: Calculate WQEBBs needed
    HQTX->>HQTX: submit_wqe(buf, credits, wqebbs, ti, skip_comp)
    Note over HQTX: Store WQE properties in linked list<br/>m_sq_wqe_idx_to_prop[index] = {buf, credits, wqebbs, ti, next}
    HQTX->>HQTX: m_sq_wqe_prop_last = &m_sq_wqe_idx_to_prop[index]
    HQTX->>HQTX: ring_doorbell(wqebbs, skip_comp)
    Note over HQTX: Update WQE counter<br/>m_sq_wqe_counter += wqebbs
    HQTX->>HW: Ring doorbell (BF write)

    Note over App,HW: Continue sending to fill SQ
    loop Fill SQ with multiple WQEs
        App->>Ring: xlio_socket_send(data, send_attr)
        Ring->>HQTX: send_wqe(...)
        HQTX->>HQTX: submit_wqe(...)
        Note over HQTX: Link current WQE to previous<br/>prop.next = m_sq_wqe_prop_last
        HQTX->>HW: Ring doorbell
    end

    Note over App,HW: Poll for Completions
    App->>Ring: xlio_poll_group_poll(group)
    Ring->>CQ: poll_and_process_element_tx(&poll_sn)
    CQ->>CQ: get_cqe_tx(num_polled_cqes)
    CQ->>HW: Read CQE from completion queue
    HW-->>CQ: CQE with wqe_counter
    
    alt CQE has error
        CQ->>CQ: log_cqe_error(cqe, index, credits)
        Note over CQ: Log WQE hex dump and error info
        CQ->>CQ: Mark buffer with HAD_CQE_ERROR flag
    end

    CQ->>CQ: handle_sq_wqe_prop(index)
    Note over CQ: Process linked list of completed WQEs
    loop Until reaching m_last_sq_wqe_prop_to_complete
        CQ->>CQ: p = &m_sq_wqe_idx_to_prop[index]
        alt Buffer exists
            CQ->>Ring: mem_buf_desc_return_single_locked(p->buf)
        end
        alt TI exists
            CQ->>CQ: ti->m_callback(ti->m_callback_arg)
            CQ->>CQ: ti->put()
            alt TI released and ref==0
                CQ->>CQ: ti->ti_released()
            end
        end
        CQ->>CQ: credits_return(p->credits)
        CQ->>CQ: p = p->next
    end
    CQ->>CQ: m_last_sq_wqe_prop_to_complete = next WQE prop

    CQ->>Ring: return_tx_pool_to_global_pool()
    CQ->>HQTX: credits_return(credits)
    Note over HQTX: m_sq_free_credits += credits
    CQ->>CQ: update_global_sn_tx(poll_sn, num_polled_cqes)

    Note over App,HW: Credits returned, can send more
Loading

11 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

tomerdbz
tomerdbz previously approved these changes Oct 28, 2025
BasharRadya
BasharRadya previously approved these changes Nov 2, 2025
@galnoam
Copy link
Collaborator

galnoam commented Nov 3, 2025

@AlexanderGrissik , please resolve conversations.

We tracked the index of last cmpleted WQE, then on each CQE we iterate from the CQE index back until we reach last completed WQE.
However, since this WQE is considered free more than one WQE can be written in its space especially when we fill SQ to maximum.
Consider the following state:
| WQE3 | last_signalled (4 wqebb) | WQE1 | WQE2 |
Now since last_signalled space is considered free we may add more WQE, for instance,
| WQE3 | WQE4 (2wqebb, last_signalled_ptr) | WQE5 (1 wqebb) | WQE6 (1 wqebb) | WQE1 | WQE2 |
Now if we get completion for WQE6, we would complete untill last_signalled (WQE6 and WQE5), leaking the completions for WQE1,2,3,4.
This results in credits and resource leakage.

To resolve this issue, we replace the last_signalled with last_to_complete and keep the size of wqe in wqebbs per each sq-prop.
last_to_complete always points to the left most uncompleted wqe. Given the example above, last_to_complete will point to WQE1.
After completion loop last_to_complete will point to the next WQE index to be completed being it already psted or not.
Given | WQE1 (last_to_complete) | WQE2 | WQE3 |, on CQE for WQE3, we complete WQE3,WQE2,WQE1 and move last_to_complete to WQE4.
We can do this since we know the size of each WQE in wqebbs.

Removing unused reset_inflight_zc_buffers_ctx.

Signed-off-by: Alexander Grissik <[email protected]>
@galnoam galnoam merged commit 5fd84fb into Mellanox:vNext Nov 5, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants