Skip to content

Conversation

@cliffburdick
Copy link
Contributor

@cliffburdick cliffburdick commented May 1, 2025

Adds RDMA support to ANO

Initial support is RC mode only with support for both client and server modes, multiple queues, and multiple threads. API is similar to existing ANO backends. See design document for more details.

Signed-off-by: Cliff Burdick <[email protected]>
Signed-off-by: Cliff Burdick <[email protected]>
Signed-off-by: Cliff Burdick <[email protected]>
Signed-off-by: Cliff Burdick <[email protected]>
@cliffburdick cliffburdick marked this pull request as ready for review May 9, 2025 02:01
- name: data1
rdma_mode: client
rdma_transport_mode: RC
address: 192.168.11.2 # The address to use, or leave blank for auto-detect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave blank for auto-detect

  1. RDMA only?
  2. For client only, or server as well?
  3. How does that work?
    2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the client it acts like any other socket connection where if the client doesn't specify a source IP then the routing tables dictate which interface is used. By putting the IP here we are telling it specifically which we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Update docs to indicate that address supports IP but only for RDMA, and autodetect only works or client.
  2. Can we check the failure when passing an IP with non-rdma backends? make sure error message is clear
  3. Check failure when passing NIC to RDMA backend, make sure error message is clear

Comment on lines 48 to 53
if (send_.get()) {
send_mr_name_ = server_.get() ? "DATA_TX_CPU_SERVER" : "DATA_TX_CPU_CLIENT";
}
if (receive_.get()) {
receive_mr_name_ = server_.get() ? "DATA_RX_CPU_SERVER" : "DATA_RX_CPU_CLIENT";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaking, we've never had to interface directly with the memory regions in the operators when using the other backends. I see it's passed to rdma_set_header. Can ANO not infer the adequate memory region to use looking at the ANO config and the other inputs (port, queue)?

Any change of the memory regions in the yaml config file would require updates in the operator implementation as well, which defeats the purpose of a config file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other backends referenced the memory region with the port and queue number, and that was tied 1:1 with the memory region. With RDMA you have a little more flexibility in that you can use the same memory region for many different ports and queues if you want. We could allow them to put a port/queue pair, but they would still have to edit the code. Another option is I could just add it to the client and server application config...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per discussion, switch to port and queue for now for consistency with other backends + ensuring there is no conflicting "binding" between what is in the config and what is written in the app code. Will revisit design for Tx with dynamic memory regions with C++ interface in the future.

Copy link
Contributor

@agirault agirault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cliffburdick.

Please update docs, tests, and CHANGELOG.md 🙏

* @param conn_id Connection ID
* @param server True if server, false if client
*/
Status get_rx_burst(BurstParams** burst, uintptr_t conn_id, bool server);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_rdma_burst ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it use an overload instead of a different name to keep the API the same. I'm not too convinced either way is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd say for anyone not super familiar with rdma vs other, seeing the signature only won't make it clear it's for rdma only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection ID only applies to RDMA. I could change it to be a more specific type. Specifically with the RX and TX functions I really wanted to avoid changing the signature otherwise it's very different from the other backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have docs clarify how connection id and server are used?

cliffburdick and others added 6 commits June 9, 2025 10:46
…_rx.yaml

Co-authored-by: Alexis Girault <[email protected]>
Signed-off-by: Cliff Burdick <[email protected]>
Co-authored-by: Alexis Girault <[email protected]>
Signed-off-by: Cliff Burdick <[email protected]>
Co-authored-by: Alexis Girault <[email protected]>
Signed-off-by: Cliff Burdick <[email protected]>
Co-authored-by: Alexis Girault <[email protected]>
Signed-off-by: Cliff Burdick <[email protected]>
Co-authored-by: Alexis Girault <[email protected]>
Signed-off-by: Cliff Burdick <[email protected]>
@bhashemian bhashemian added this to the Networking milestone Sep 9, 2025
@bhashemian bhashemian moved this to In Progress in Holohub Sep 9, 2025
@agirault agirault moved this from In Progress to On Hold in Holohub Sep 10, 2025
@bhashemian
Copy link
Member

Hi @cliffburdick, could you please resolve the conflicts for this PR and update it with the latest changes on main branch? Thanks

@cliffburdick
Copy link
Contributor Author

@bhashemian I can do that, but since I'm really the only person working on the ANO I think we need to merge these much faster after I've tested them. It's a large amount of effort to rebase these months after they haven't been merged.

@bhashemian
Copy link
Member

@bhashemian I can do that, but since I'm really the only person working on the ANO I think we need to merge these much faster after I've tested them. It's a large amount of effort to rebase these months after they haven't been merged.

That’s fair, @cliffburdick! Thanks for your feedback. We’re working on streamlining the reviewing process to expedite the merging of PRs.

@bhashemian
Copy link
Member

@cliffburdick could you please let me know when are you planning to update this PR? I just want to make sure that we can merge it as soon as possible. Thanks

@cliffburdick
Copy link
Contributor Author

@cliffburdick could you please let me know when are you planning to update this PR? I just want to make sure that we can merge it as soon as possible. Thanks

Hi Bruce, the PR needs a number of items addressed outside of rebasing. These are captured in some comments above and on slack. I plan to get to it next week since I don't have a system configured to test this on at the moment.

@bhashemian
Copy link
Member

@cliffburdick could you please let me know when are you planning to update this PR? I just want to make sure that we can merge it as soon as possible. Thanks

Hi Bruce, the PR needs a number of items addressed outside of rebasing. These are captured in some comments above and on slack. I plan to get to it next week since I don't have a system configured to test this on at the moment.

@cliffburdick that sounds great! Just ping me when this is ready. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: On Hold

Development

Successfully merging this pull request may close these issues.

3 participants