Skip to content

verbs: Allow query only device support for QP data in order #1606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dkkranz
Copy link
Contributor

@dkkranz dkkranz commented May 4, 2025

EFA can support 128 bytes blocks write in-order for some Grace platforms. Move the check on x86 architecture to the mlx5 provider implementation.

Reviewed-by: Michael Margolin [email protected]
Reviewed-by: Yonatan Nachum [email protected]

@dkkranz dkkranz force-pushed the query_data_in_order_fix branch from 70679e1 to b4b0368 Compare May 5, 2025 07:46
@jgunthorpe
Copy link
Member

I don't think this is right, it would break things like PPC. The platform check is supposed to be in the common code, and the driver check is only about how the device works on PCI.

You should probably just add ARM to the inclusion list, though I imagine there are some wonky ARMs that don't work, they may not matter in practice.

@dkkranz dkkranz force-pushed the query_data_in_order_fix branch 2 times, most recently from 3e995d5 to fc236b7 Compare May 7, 2025 11:56
@dkkranz
Copy link
Contributor Author

dkkranz commented May 7, 2025

Applied your suggestion, thanks

@dkkranz
Copy link
Contributor Author

dkkranz commented May 20, 2025

@jgunthorpe Kind reminder

@jgunthorpe
Copy link
Member

I don't know.. I just feel uneasy about this. ARM is such a wide and varied architecture I don't know if we can really say it is actually reliably in order. Grace might be just because it has 128 byte lines, but that doesn't mean a different ARM chip will be. This could break some of the MPIs running on ARM supercomputers.. Do you think you could limit it just to Grace? Did someone from NVIDIA confirm that this is actually true on Grace?

@jgunthorpe
Copy link
Member

I consulted with people here and unfortunately I don't think using ARM platform as a condition is sufficient. Even looking at NVIDIA CPUs this only works on certain CPUs, with certain configurations.

I guess EFA could be special since it really only exists in AWS instances and you guys will do the validation. I'd still strongly recommend that you query your device FW to see if it is supported as I'm not sure you will find all future systems will work this way.

So maybe the right thing is to add a new op query_qp_data_in_order_force() or something like that that skips the architecture check in the core code. I don't really want to have drivers doing architecture checks..

@dkkranz dkkranz force-pushed the query_data_in_order_fix branch 2 times, most recently from 4ced708 to a7f0291 Compare May 28, 2025 07:46
@dkkranz
Copy link
Contributor Author

dkkranz commented May 28, 2025

@jgunthorpe Please review the last revision that adds a flag to skip the architecture check.

@jgunthorpe
Copy link
Member

I didn't mean a flag to userspace, there is nothing userspace can do with this. I ment some kind of flag from EFA to the core code to disable the arch checks in the core code just for EFA.

@mrgolin
Copy link
Contributor

mrgolin commented Jun 3, 2025

I don't really want to have drivers doing architecture checks..

But this is exactly what you are suggesting now, and I'm not sure why we need to have architecture checks in libibverbs either.

I think there is use for a common way to query whether data is written in order from device perspective only, for instance data polling from GPU isn't necessarily affected by the CPU architecture. Don't you agree?

@dkkranz
Copy link
Contributor Author

dkkranz commented Jun 24, 2025

@jgunthorpe Was your concern addressed by @mrgolin on his last comment? Do we have any blocker for merging this?

@jgunthorpe
Copy link
Member

No, I don't want architecture checks in drivers. EFA is being very weird and special here by saying that since EFA is only part of a fully engineered cloud system it can know that it only exists in conjunction with, otherwise architecture undefined, behavior.

I think what you want is to just make EFA do this unconditionally so EFA and only EFA would skip the architecture checks in the core code. If you ever deploy EFA on some implementation that doesn't work that way then EFA should report this to the driver through the FW.

@jgunthorpe
Copy link
Member

Maybe you could reorganize things so the driver calls out some kind of core helper to test if the CPU/platform is OK for some properties and then just not call that on EFA.

@dkkranz dkkranz force-pushed the query_data_in_order_fix branch from a7f0291 to 2499c96 Compare June 25, 2025 14:11
@dkkranz dkkranz changed the title verbs: Query QP data in order on non x86 platforms efa: Expose 128 bytes data polling device capability Jun 25, 2025
@dkkranz
Copy link
Contributor Author

dkkranz commented Jun 25, 2025

The EFA capability will be exposed via direct EFA device attribute, without changing the general API.
Please review the changes.

@jgunthorpe
Copy link
Member

Honestly I like this less, we try to reserve DV to be only for cases that are not general. Re-exposing the capability through DV to avoid refactoring the core code does not seem like good justification. Did you have a another reason to use DV?

@mrgolin
Copy link
Contributor

mrgolin commented Jul 4, 2025

We might not explained the need well, I think what we tried to do is to expose a bit different capability than what ibv_query_qp_data_in_order, that is limited to CPU perspective, currently gives. What we want is something to tell NCCL whether it can use LL128 on GPU regardless of what CPU can assume on data ordering. This isn't specific to EFA and I guess other providers will need it sooner or later, and that's why @dkkranz previously suggested adding a flag to the common verb. I wouldn't want to change ibv_query_qp_data_in_order meaning only for EFA.

@jgunthorpe
Copy link
Member

The ability to do last data polling is largely a property of the receiving unit. Yes the verb is intended to cover CPU memory only.

If you want to expose a property it should be if the RDMA device issues PCIe TLPs strictly in order and strictly with non-relaxed ordering. ie it is a PCIe behavior property only. Then you could cross that with some GPU property if it can do last data polling with such TLPs. That could be a general purpose verb.

@mrgolin
Copy link
Contributor

mrgolin commented Jul 5, 2025

If you want to expose a property it should be if the RDMA device issues PCIe TLPs strictly in order and strictly with non-relaxed ordering.

Yes, this is something RDMA device can tell and it's what I would like to expose as a general purpose verb (or an extension of existing one).

Then you could cross that with some GPU property if it can do last data polling with such TLPs.

I think we better leave this part to upper layers that can be aware of GPU behavior or system topology.

@jgunthorpe
Copy link
Member

If this is what you want to do I think a new general verb flag would be fine - report the PCI device behavior only, and explain in the manual this is only half the problem and any software using this must also consider the behavior of the receiving device.

@dkkranz dkkranz force-pushed the query_data_in_order_fix branch from 2499c96 to e00c6c3 Compare July 8, 2025 14:47
@dkkranz
Copy link
Contributor Author

dkkranz commented Jul 8, 2025

Done.

@@ -695,23 +695,26 @@ LATEST_SYMVER_FUNC(ibv_query_qp, 1_1, "IBVERBS_1.1",
int ibv_query_qp_data_in_order(struct ibv_qp *qp, enum ibv_wr_opcode op,
uint32_t flags)
{
int result;

if (!check_comp_mask(flags, IBV_QUERY_QP_DATA_IN_ORDER_RETURN_CAPS))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function sure has become weird, flags is less of a flags and more of a selector now..

Still, flags has to be validated.. It can only be 0, IBV_QUERY_QP_DATA_IN_ORDER_RETURN_CAPS or IBV_QUERY_QP_DATA_IN_ORDER_DEVICE_ONLY, everything else is a failure and returns 0.

@jgunthorpe
Copy link
Member

The commit to have EFA implement this is missing??

Add a flag to query directly the RDMA device support for QP
data-in-order semantics without enforcing host CPU architecture
restrictions. It is particularly useful in scenarios where the GPU
performs data polling directly, with the application responsible for
ensuring the GPU side support for data-in-order semantics.

Reviewed-by: Michael Margolin <[email protected]>
Reviewed-by: Yonatan Nachum <[email protected]>
Signed-off-by: Daniel Kranzdorf <[email protected]>
@dkkranz dkkranz force-pushed the query_data_in_order_fix branch from e00c6c3 to dc3b25f Compare July 13, 2025 08:47
@dkkranz
Copy link
Contributor Author

dkkranz commented Jul 13, 2025

I fixed the flags check, thanks.
There is no change required in the EFA implementation. The API was only blocked in the common verb check on the host architecture.

@dkkranz dkkranz changed the title efa: Expose 128 bytes data polling device capability verbs: Allow query only device support for QP data in order Jul 13, 2025
@jgunthorpe
Copy link
Member

Something is wrong here, you can't just add new API with new driver semantics without changing the existing drivers, somehow. What is the story?

@dkkranz
Copy link
Contributor Author

dkkranz commented Jul 22, 2025

This change is backwards compatible. API calls to ibv_query_qp_data_in_order without the new IBV_QUERY_QP_DATA_IN_ORDER_DEVICE_ONLY flag will behave the same way as before. For NCCL to use this API to determine support for the LL128 protocol on arm platforms, its EFA plugin will have to pass the new flag.
I don't understand your concern.

@dkkranz
Copy link
Contributor Author

dkkranz commented Jul 22, 2025

API calls with the new flag on arm platforms will bypass the arch restriction and will call the internal EFA verb, which already returns directly the device support for data-in-order write, without any host arch restrictions. Thus, there is no necessary change in the EFA verb.

@mrgolin
Copy link
Contributor

mrgolin commented Jul 23, 2025

Providers' implementations of query_qp_data_in_order op are expected to return device level only support (network, device internals and PCIe TLPs ordering), what is exactly what ibv_query_qp_data_in_order with the new IBV_QUERY_QP_DATA_IN_ORDER_DEVICE_ONLY flag promises. Am I missing something?

@mrgolin
Copy link
Contributor

mrgolin commented Aug 10, 2025

Providers' implementations of query_qp_data_in_order op are expected to return device level only support (network, device internals and PCIe TLPs ordering), what is exactly what ibv_query_qp_data_in_order with the new IBV_QUERY_QP_DATA_IN_ORDER_DEVICE_ONLY flag promises. Am I missing something?

@jgunthorpe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants