Skip to content

Conversation

@BasharRadya
Copy link
Collaborator

@BasharRadya BasharRadya commented Apr 8, 2025

User description

Description

Fix read flow tag from cqe

The flow_tag field in the CQE format is 24 bits, but it was handled
as if it were 32 bits. The additional 8 bits are reserved for
the rx_drop_counter.
Consequently, when there were RX drops, the flow_tag was incorrect
because the most significant 8 bits were not zero.

CQE format for flow_tag is in "Table 186 - 64B Completion Queue Entry Format Layout
"

Add CQ rx_drop_counter

The number of dropped packets because of no
RCV WQE since the last CQE
What

Subject: what this PR is doing in one line.

Why ?

Justification for the PR. If there is existing issue/bug please reference.

How ?

It is optional but for complex PRs please provide information about the design,
architecture, approach, etc.

Change type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Check list

  • Code follows the style de facto guidelines of this project
  • Comments have been inserted in hard to understand places
  • Documentation has been updated (if necessary)
  • Test has been added (if possible)

PR Type

Bug fix, Enhancement


Description

  • Corrected handling of flow_tag field in CQE to use 24 bits.

  • Added support for tracking RX drop counter in CQ statistics.

  • Updated logging and statistics printing to include RX drop counter.

  • Adjusted data structures and parsing logic to align with updated CQE format.


Changes walkthrough 📝

Relevant files
Enhancement
cq_mgr_rx.cpp
Add RX drop counter logging in statistics                               

src/core/dev/cq_mgr_rx.cpp

  • Added logging for RX drop counter in statistics print function.
  • Updated condition to include RX drop counter in statistics checks.
  • +5/-2     
    stats_reader.cpp
    Add RX drop counter to statistics calculations                     

    src/stats/stats_reader.cpp

  • Added RX drop counter to delta statistics calculation.
  • Updated statistics printing to include RX drop counter.
  • +5/-0     
    xlio_stats.h
    Add RX drop counter to statistics structure                           

    src/core/util/xlio_stats.h

    • Added n_rx_drop_counter field to CQ statistics structure.
    +1/-0     
    Bug fix
    cq_mgr_rx_regrq.cpp
    Fix flow_tag parsing and add RX drop counter                         

    src/core/dev/cq_mgr_rx_regrq.cpp

  • Corrected flow_tag parsing to use 24 bits.
  • Added RX drop counter extraction from CQE.
  • +3/-1     
    cq_mgr_rx_strq.cpp
    Fix flow_tag parsing and add RX drop counter                         

    src/core/dev/cq_mgr_rx_strq.cpp

  • Corrected flow_tag parsing to use 24 bits.
  • Added RX drop counter extraction from CQE.
  • +3/-1     
    Miscellaneous
    ib_mlx5.h
    Rename CQE field for clarity                                                         

    src/core/ib/mlx5/ib_mlx5.h

    • Renamed sop_drop_qpn to sop_rxdrop_qpn_flowtag for clarity.
    +1/-1     

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @AlexanderGrissik
    Copy link
    Collaborator

    bot:retest

    @galnoam
    Copy link
    Collaborator

    galnoam commented Apr 9, 2025

    /review

    @pr-review-bot-app
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Possible Misinterpretation of Bit Manipulation

    The bit manipulation logic for extracting flow_tag_id and n_rx_drop_counter from sop_rxdrop_qpn_flowtag_h_byte should be carefully reviewed to ensure correctness and alignment with the CQE format specification.

    uint32_t sop_rxdrop_qpn_flowtag_h_byte = ntohl((uint32_t)(cqe->sop_rxdrop_qpn_flowtag));
    p_rx_wc_buf_desc->rx.flow_tag_id = sop_rxdrop_qpn_flowtag_h_byte & 0x00FFFFFF;
    m_p_cq_stat->n_rx_drop_counter += sop_rxdrop_qpn_flowtag_h_byte >> 24;
    Consistency in Bit Extraction Logic

    The bit extraction logic for flow_tag_id and n_rx_drop_counter in this file should be validated to ensure it matches the intended CQE format and does not introduce errors.

    uint32_t sop_rxdrop_qpn_flowtag_h_byte = ntohl((uint32_t)(cqe->sop_rxdrop_qpn_flowtag));
    _hot_buffer_stride->rx.flow_tag_id = sop_rxdrop_qpn_flowtag_h_byte & 0x00FFFFFF;
    m_p_cq_stat->n_rx_drop_counter += sop_rxdrop_qpn_flowtag_h_byte >> 24;

    @tomerdbz
    Copy link
    Collaborator

    tomerdbz commented Apr 9, 2025

    /suggest

    @tomerdbz
    Copy link
    Collaborator

    tomerdbz commented Apr 9, 2025

    /describe

    @pr-review-bot-app
    Copy link

    PR Description updated to latest commit (c562f6a)

    @tomerdbz
    Copy link
    Collaborator

    tomerdbz commented Apr 9, 2025

    /improve

    @pr-review-bot-app
    Copy link

    pr-review-bot-app bot commented Apr 9, 2025

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Prevent division by zero errors

    Add a check to ensure delay is not zero to prevent division by zero errors.

    src/stats/stats_reader.cpp [401-402]

    -p_prev_cq_stats->n_rx_drop_counter =
    -    (p_curr_cq_stats->n_rx_drop_counter - p_prev_cq_stats->n_rx_drop_counter) / delay;
    +if (delay != 0) {
    +    p_prev_cq_stats->n_rx_drop_counter =
    +        (p_curr_cq_stats->n_rx_drop_counter - p_prev_cq_stats->n_rx_drop_counter) / delay;
    +} else {
    +    // Handle zero delay case appropriately
    +}
    Suggestion importance[1-10]: 9

    __

    Why: This suggestion prevents a critical runtime error (division by zero) by adding a validation check for the delay variable. It is highly impactful as it ensures the code does not crash under certain conditions and handles edge cases appropriately.

    High
    Add overflow check for counter updates

    Ensure that the addition to m_p_cq_stat->n_rx_drop_counter does not cause an integer
    overflow by validating the value before performing the addition.

    src/core/dev/cq_mgr_rx_regrq.cpp [120-122]

     uint32_t sop_rxdrop_qpn_flowtag_h_byte = ntohl((uint32_t)(cqe->sop_rxdrop_qpn_flowtag));
     p_rx_wc_buf_desc->rx.flow_tag_id = sop_rxdrop_qpn_flowtag_h_byte & 0x00FFFFFF;
    -m_p_cq_stat->n_rx_drop_counter += sop_rxdrop_qpn_flowtag_h_byte >> 24;
    +uint32_t drop_counter_increment = sop_rxdrop_qpn_flowtag_h_byte >> 24;
    +if (UINT64_MAX - m_p_cq_stat->n_rx_drop_counter >= drop_counter_increment) {
    +    m_p_cq_stat->n_rx_drop_counter += drop_counter_increment;
    +} else {
    +    // Handle overflow case appropriately
    +}
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion addresses a potential integer overflow issue when updating m_p_cq_stat->n_rx_drop_counter, which is critical for ensuring the correctness and stability of the code. The improved code provides a robust solution by adding a validation check.

    Medium
    Prevent overflow in counter updates

    Validate the value being added to m_p_cq_stat->n_rx_drop_counter to prevent
    potential integer overflow issues.

    src/core/dev/cq_mgr_rx_strq.cpp [227-229]

     uint32_t sop_rxdrop_qpn_flowtag_h_byte = ntohl((uint32_t)(cqe->sop_rxdrop_qpn_flowtag));
     _hot_buffer_stride->rx.flow_tag_id = sop_rxdrop_qpn_flowtag_h_byte & 0x00FFFFFF;
    -m_p_cq_stat->n_rx_drop_counter += sop_rxdrop_qpn_flowtag_h_byte >> 24;
    +uint32_t drop_counter_increment = sop_rxdrop_qpn_flowtag_h_byte >> 24;
    +if (UINT64_MAX - m_p_cq_stat->n_rx_drop_counter >= drop_counter_increment) {
    +    m_p_cq_stat->n_rx_drop_counter += drop_counter_increment;
    +} else {
    +    // Handle overflow case appropriately
    +}
    Suggestion importance[1-10]: 8

    __

    Why: Similar to the previous suggestion, this one mitigates the risk of integer overflow in m_p_cq_stat->n_rx_drop_counter. It is highly relevant and improves the reliability of the code by introducing a validation mechanism.

    Medium

    @tomerdbz
    Copy link
    Collaborator

    tomerdbz commented Apr 9, 2025

    /analyze

    @galnoam galnoam added the draft Not to review yet label Apr 15, 2025
    The flow_tag field in the CQE format is 24 bits, but it was handled
    as if it were 32 bits. The additional 8 bits are reserved for
    the rx_drop_counter.
    Consequently, when there were RX drops, the flow_tag was incorrect
    because the most significant 8 bits were not zero.
    
    Signed-off-by: Bashar Abdelgafer <[email protected]>
    @BasharRadya BasharRadya force-pushed the fix_read_flow_tag_from_cqe branch from a6f3d19 to f17c65b Compare April 15, 2025 09:07
    @BasharRadya BasharRadya force-pushed the fix_read_flow_tag_from_cqe branch from f17c65b to 08a78af Compare April 15, 2025 12:50
    @BasharRadya BasharRadya requested a review from pasis April 15, 2025 13:12
    @BasharRadya BasharRadya force-pushed the fix_read_flow_tag_from_cqe branch from 08a78af to 1431c3b Compare April 15, 2025 13:12
    @BasharRadya
    Copy link
    Collaborator Author

    bot:retest

    @BasharRadya BasharRadya force-pushed the fix_read_flow_tag_from_cqe branch from 1431c3b to 003a776 Compare April 18, 2025 09:54
    @BasharRadya
    Copy link
    Collaborator Author

    bot:retest

    1 similar comment
    @AlexanderGrissik
    Copy link
    Collaborator

    bot:retest

    @BasharRadya BasharRadya force-pushed the fix_read_flow_tag_from_cqe branch 4 times, most recently from 7d11f34 to 462e5ff Compare April 22, 2025 11:24
    @BasharRadya
    Copy link
    Collaborator Author

    bot:retest

    @BasharRadya BasharRadya force-pushed the fix_read_flow_tag_from_cqe branch from 462e5ff to b39491a Compare April 28, 2025 08:54
    @BasharRadya
    Copy link
    Collaborator Author

    bot:retest

    1 similar comment
    @BasharRadya
    Copy link
    Collaborator Author

    bot:retest

    }
    }

    uint64_t cq_mgr_rx::get_n_rx_drop_counter()
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    move this definition to the header

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Fixed.
    but why? because its 1 line?

    The number of dropped packets because of no
    RCV WQE since the last CQE
    
    Signed-off-by: Bashar Abdelgafer <[email protected]>
    Added an accumlated RX out of buffer drop counter to the print report
    The accumlated result includes:
    A)RX drops counts from open rings at termination phase.
    B)RX drops from rings that were closed during runtime before
    the termination phase (e.g. when all connections belonging to
    the ring were closed).
    
    Signed-off-by: Bashar Abdelgafer <[email protected]>
    @BasharRadya BasharRadya force-pushed the fix_read_flow_tag_from_cqe branch from b39491a to 11698d6 Compare April 28, 2025 14:13
    @BasharRadya BasharRadya requested a review from pasis April 28, 2025 14:19
    In 'AUTO' mode, the report is printed only if an anomaly
    is detected (e.g., RX drops or buffer pool exhaustion).
    Set XLIO_PRINT_REPORT default to 'AUTO', and update README
    accordingly.
    
    Signed-off-by: Bashar Abdelgafer <[email protected]>
    Renamed the print label from 'Packets dropped' to 'SW RX Packets dropped'
    to make it clear that the drop counter refers to software-level RX drops.
    
    Signed-off-by: Bashar Abdelgafer <[email protected]>
    @BasharRadya BasharRadya force-pushed the fix_read_flow_tag_from_cqe branch from 11698d6 to ce9a57f Compare April 29, 2025 09:07
    Rename n_rx_drop_counter and n_rx_pkt_drop.
    
    Signed-off-by: Bashar Abdelgafer <[email protected]>
    @BasharRadya BasharRadya force-pushed the fix_read_flow_tag_from_cqe branch from ce9a57f to 2bdbc8a Compare April 29, 2025 10:22
    @BasharRadya BasharRadya requested a review from pasis April 29, 2025 10:23
    @BasharRadya
    Copy link
    Collaborator Author

    bot:retest

    3 similar comments
    @BasharRadya
    Copy link
    Collaborator Author

    bot:retest

    @pasis
    Copy link
    Member

    pasis commented Apr 30, 2025

    bot:retest

    @BasharRadya
    Copy link
    Collaborator Author

    bot:retest

    @galnoam galnoam merged commit dcb8a70 into Mellanox:vNext May 5, 2025
    1 check passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    5 participants