RDMA support in ANO #819

cliffburdick · 2025-05-01T19:44:37Z

Adds RDMA support to ANO

Initial support is RC mode only with support for both client and server modes, multiple queues, and multiple threads. API is similar to existing ANO backends. See design document for more details.

Signed-off-by: Cliff Burdick <[email protected]>

… issue Signed-off-by: Cliff Burdick <[email protected]>

Signed-off-by: Cliff Burdick <[email protected]>

applications/adv_networking_bench/adv_networking_bench_rdma_tx_rx.yaml

agirault · 2025-05-23T14:44:01Z

applications/adv_networking_bench/adv_networking_bench_rdma_tx_rx.yaml

+    - name: data1
+      rdma_mode: client
+      rdma_transport_mode: RC
+      address: 192.168.11.2       # The address to use, or leave blank for auto-detect


leave blank for auto-detect

RDMA only?

For client only, or server as well?

How does that work?
2

For the client it acts like any other socket connection where if the client doesn't specify a source IP then the routing tables dictate which interface is used. By putting the IP here we are telling it specifically which we want.

Update docs to indicate that address supports IP but only for RDMA, and autodetect only works or client.

Can we check the failure when passing an IP with non-rdma backends? make sure error message is clear

Check failure when passing NIC to RDMA backend, make sure error message is clear

applications/adv_networking_bench/adv_networking_bench_rdma_tx_rx.yaml

applications/adv_networking_bench/cpp/main.cpp

applications/adv_networking_bench/cpp/rdma_bench.h

agirault · 2025-05-23T15:02:43Z

applications/adv_networking_bench/cpp/rdma_bench.h

+    if (send_.get()) {
+      send_mr_name_ = server_.get() ? "DATA_TX_CPU_SERVER" : "DATA_TX_CPU_CLIENT";
+    }
+    if (receive_.get()) {
+      receive_mr_name_ = server_.get() ? "DATA_RX_CPU_SERVER" : "DATA_RX_CPU_CLIENT";
+    }


If I'm not mistaking, we've never had to interface directly with the memory regions in the operators when using the other backends. I see it's passed to rdma_set_header. Can ANO not infer the adequate memory region to use looking at the ANO config and the other inputs (port, queue)?

Any change of the memory regions in the yaml config file would require updates in the operator implementation as well, which defeats the purpose of a config file.

Other backends referenced the memory region with the port and queue number, and that was tied 1:1 with the memory region. With RDMA you have a little more flexibility in that you can use the same memory region for many different ports and queues if you want. We could allow them to put a port/queue pair, but they would still have to edit the code. Another option is I could just add it to the client and server application config...

per discussion, switch to port and queue for now for consistency with other backends + ensuring there is no conflicting "binding" between what is in the config and what is written in the app code. Will revisit design for Tx with dynamic memory regions with C++ interface in the future.

applications/adv_networking_bench/cpp/rdma_bench.h

agirault

Thanks @cliffburdick.

Please update docs, tests, and CHANGELOG.md 🙏

operators/advanced_network/advanced_network/common.cpp

agirault · 2025-05-23T16:02:34Z

operators/advanced_network/advanced_network/common.h

+ * @param conn_id Connection ID
+ * @param server True if server, false if client
+ */
+Status get_rx_burst(BurstParams** burst, uintptr_t conn_id, bool server);


get_rdma_burst ?

I had it use an overload instead of a different name to keep the API the same. I'm not too convinced either way is better.

i'd say for anyone not super familiar with rdma vs other, seeing the signature only won't make it clear it's for rdma only.

The connection ID only applies to RDMA. I could change it to be a more specific type. Specifically with the RX and TX functions I really wanted to avoid changing the signature otherwise it's very different from the other backends.

can we have docs clarify how connection id and server are used?

operators/advanced_network/advanced_network/common.h

operators/advanced_network/advanced_network/manager.cpp

operators/advanced_network/advanced_network/types.h

…_rx.yaml Co-authored-by: Alexis Girault <[email protected]> Signed-off-by: Cliff Burdick <[email protected]>

Co-authored-by: Alexis Girault <[email protected]> Signed-off-by: Cliff Burdick <[email protected]>

applications/adv_networking_bench/adv_networking_bench_rdma_tx_rx.yaml

Signed-off-by: Cliff Burdick <[email protected]>

bhashemian · 2025-10-20T15:06:10Z

Hi @cliffburdick, could you please resolve the conflicts for this PR and update it with the latest changes on main branch? Thanks

cliffburdick · 2025-10-20T15:15:57Z

@bhashemian I can do that, but since I'm really the only person working on the ANO I think we need to merge these much faster after I've tested them. It's a large amount of effort to rebase these months after they haven't been merged.

bhashemian · 2025-10-20T15:27:04Z

@bhashemian I can do that, but since I'm really the only person working on the ANO I think we need to merge these much faster after I've tested them. It's a large amount of effort to rebase these months after they haven't been merged.

That’s fair, @cliffburdick! Thanks for your feedback. We’re working on streamlining the reviewing process to expedite the merging of PRs.

bhashemian · 2025-10-28T15:02:27Z

@cliffburdick could you please let me know when are you planning to update this PR? I just want to make sure that we can merge it as soon as possible. Thanks

cliffburdick · 2025-10-28T15:23:49Z

@cliffburdick could you please let me know when are you planning to update this PR? I just want to make sure that we can merge it as soon as possible. Thanks

Hi Bruce, the PR needs a number of items addressed outside of rebasing. These are captured in some comments above and on slack. I plan to get to it next week since I don't have a system configured to test this on at the moment.

bhashemian · 2025-10-28T15:58:48Z

@cliffburdick could you please let me know when are you planning to update this PR? I just want to make sure that we can merge it as soon as possible. Thanks

Hi Bruce, the PR needs a number of items addressed outside of rebasing. These are captured in some comments above and on slack. I plan to get to it next week since I don't have a system configured to test this on at the moment.

@cliffburdick that sounds great! Just ping me when this is ready. Thanks

cliffburdick requested review from agirault, ronyrad and tbirdso May 8, 2025 23:00

cliffburdick added 6 commits May 8, 2025 23:00

Rebased RDMA support

1354fd6

Signed-off-by: Cliff Burdick <[email protected]>

Make common file for RDMA testing

544915c

Signed-off-by: Cliff Burdick <[email protected]>

Use single map for CM IDs

ca8e5bf

Signed-off-by: Cliff Burdick <[email protected]>

API cleanup

5eb7813

Signed-off-by: Cliff Burdick <[email protected]>

Cleaned up RDMA benchmark app. Fixed bug with lengths. Fixed shutdown…

1e4ab28

… issue Signed-off-by: Cliff Burdick <[email protected]>

Fix lint

10cc6e4

Signed-off-by: Cliff Burdick <[email protected]>

cliffburdick force-pushed the ano_rdma3 branch from 9aabbf8 to 10cc6e4 Compare May 8, 2025 23:00

cliffburdick marked this pull request as ready for review May 9, 2025 02:01

Merge branch 'main' into ano_rdma3

714d245