Skip to content

[BUG] DPDK raw RX can lose startup traffic under packet load without recovery or clear error reporting #141

@MikesRND

Description

@MikesRND

Summary

When a DPDK RX application is started while traffic is already arriving on the NIC, DAQIRI can report sustained rx_missed_errors and the application observes missing packets after startup as well as malformed HDS split packets.

This was reproduced with Ethernet RX queue using header/data split. The issue appears to be startup-backlog/backpressure related rather than an application parser failure.

The same receiver software was demonstrated to pass the same test when using Holoscan advanced networking library.

Observed behavior

Starting up receiver application under TX traffic load, we observed:

  • DAQIRI initializes successfully.
  • RX workers start successfully.
  • The stats thread reports rx_missed_errors: Rx: Dropped <N> packets since last poll 500ms ago
  • The application receives traffic, but with packet gaps/partial batches after startup.
  • A small number of malformed split packets are also be logged:
    Dropped malformed split RX packet ... expected 2 segment(s), found 1
    but the volume of these is tiny compared with rx_missed_errors.

In the observed run, malformed split drops were single digits, while rx_missed_errors reached tens of thousands.

Also, in the DPDK RX worker paths, several rte_ring_enqueue(...) calls appear to ignore the return value. If the application-facing ring is full during startup backlog, DAQIRI may lose a burst or leak ownership without surfacing a clear error/counter.

Expected behavior

DAQIRI should be able to startup under traffic and make startup-backlog loss diagnosable:

  • Check and handle all RX-path rte_ring_enqueue(...) failures.

  • Expose/log a specific counter for app-ring enqueue failures.

  • Distinguish clearly between:

    • NIC missed packets / ring overflow
    • mbuf allocation failures
    • application ring full
    • malformed/incomplete HDS split packets
  • Avoid leaking or double-freeing burst ownership on enqueue failure.

  • Ideally provide guidance or knobs for live-attach startup backlog.

Environment

  • DAQIRI branch: fix-pr-137-accessors
  • DAQIRI container: framework-dev 0.1.3
  • DPDK raw Ethernet RX
  • Header/data split enabled
  • ConnectX-class NIC / mlx5

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions