Support cuMem API in cross process shared memory management #217

fzyzcjy · 2025-06-17T07:28:29Z

code is roughly like this; I will work on related things more which will also verify this PR more

EDIT: it works well on the target hardware; I will try to find out some time to beautify and generalize the code (probably some time later)

youkaichao · 2025-06-17T15:10:50Z

csrc/deep_ep.cpp

+
+    for (int device = 0; device < device_count; ++device) {
+        int support = 0;
+        CU_CHECK(cuDeviceGetAttribute(&support, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, device));


I think this check is not enough. see https://forums.developer.nvidia.com/t/cudevicegetattribute-shows-i-can-use-fabric-handle-but-actually-i-cannot/336426 , even if it says it is supported, we cannot use the allocation.

let me know if your environment says something different.

Ah that's weird.

let me know if your environment says something different.

In my environment the code does work. If there is no good way to correctly know fabric support, a workaround may be, let the users pass in a bool flag to say whether they want to enable this.

Btw, https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html#group__CUDA__VA_1g899d69a862bba36449789c64b430dc7c says

Applications that intend to use CU_MEM_HANDLE_TYPE_FABRIC based memory sharing must ensure: (1) nvidia-caps-imex-channels character device is created by the driver and is listed under /proc/devices (2) have at least one IMEX channel file accessible by the user launching the application.

Wondering whether that is related or not in your env

In my environment the code does work.

what does this mean? what do you get running the code?

I think you can run it with single node h100 + cuda 12.5+, without nvidia-caps-imex-channels set up, and see if cuDeviceGetAttribute tells you fabric handle is supported.

I added the fabric handle support in pytorch just now in pytorch/pytorch#156074 , i use an actual cumem call to see if the allocation is successful.

what does this mean? what do you get running the code?

I do not run on H100 but on some other devices (that's why I make this PR - o/w DeepEP will fail to startup) and the tests pass. (Originally thought the question "let me know if your environment says something different" means "does the code run on my env i.e. my device and software etc") I will have a check on single-node H100 and update the code later when having time.

i use an actual cumem call to see if the allocation is successful.

That looks reasonable

Btw the check was from https://github.com/kvcache-ai/Mooncake/blob/main/mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp and I also checked NCCL code a bit. So if that check has issues, maybe Mooncake needs to be updated as well.

# Conflicts: # csrc/deep_ep.cpp

shifangx · 2025-10-22T02:24:04Z

will merge to main branch?

fzyzcjy · 2025-10-22T02:28:48Z

yes I hope so. will do it once having time (have other high priority task now)

# Conflicts: # csrc/deep_ep.cpp

youkaichao · 2025-11-05T15:34:00Z

csrc/deep_ep.cpp

+void cu_mem_set_access_all(void* ptr, size_t size) {
+    int device_count;
+    CUDA_CHECK(cudaGetDeviceCount(&device_count));
+
+    CUmemAccessDesc access_desc[device_count];
+    for (int idx = 0; idx < device_count; ++idx) {
+        access_desc[idx].location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+        access_desc[idx].location.id = idx;
+        access_desc[idx].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
+    }
+
+    CU_CHECK(cuMemSetAccess((CUdeviceptr)ptr, size, access_desc, device_count));
+}


this has an implicit assumption that all ranks see the same number of gpus.

a better practice would be the importer call cuMemSetAccess for itself after importing.

I do it just for simplicity and yes it can be changed

youkaichao · 2025-11-05T15:36:15Z

csrc/deep_ep.cpp

+        CU_CHECK(cuMemRetainAllocationHandle(&handle, ptr));
+
+        CU_CHECK(cuMemExportToShareableHandle(&mem_handle->inner.cu_mem_fabric_handle, handle, CU_MEM_HANDLE_TYPE_FABRIC, 0));
+    } else {
+        CUDA_CHECK(cudaIpcGetMemHandle(&mem_handle->inner.cuda_ipc_mem_handle, ptr));


mixing cumem and cudamalloc can be problematic 🤔

it seems to be a constant bool flag if I understand correctly

youkaichao

cumem APIs are fragile and error-prone. if possible, I'd suggest using some existing libraries to allocate such shared memory (e.g. pytorch), and deepep just uses that buffer without all these pains.

fzyzcjy · 2025-11-05T23:43:12Z

that is also reasonable, though for simplicity I choose to replace cuda malloc etc with almost equivalent cumem apis

fzyzcjy added 30 commits June 17, 2025 15:27

more

443bfa8

more

b986cce

more

3ea6f58

more

5d3513b

more

bda5695

more

3740762

more

ad4aee8

more

b5e4aad

more

240d058

more

5379d59

more

4fc8e79

more

2e90afe

more

3639a57

more

4ef8f05

more

047656e

more

c21f36d

more

7f3e4c0

more

92fb573

more

29f86f3

more

5557e70

more

9fd34e7

more

6417393

more

faaeaad

more

c38dbed

more

dc74c0a

more

61dea30

more

7d4bc93

more

5b78f22

more

75351cd

more

7bb12d4

fzyzcjy added 6 commits June 17, 2025 16:25

more

5b23a8a

more

210e499

more

379ac24

more

43999dc

more

7916011

more

0525f8f

fzyzcjy changed the title ~~[WIP] Support cuMem API in cross process shared memory management~~ Support cuMem API in cross process shared memory management Jun 17, 2025

fzyzcjy marked this pull request as ready for review June 17, 2025 08:44

fzyzcjy mentioned this pull request Jun 17, 2025

Support other NVLink scenarios #218

Open

youkaichao reviewed Jun 17, 2025

View reviewed changes

augustinjujutsu approved these changes Jun 19, 2025

View reviewed changes

LyricZhao force-pushed the main branch from 6a7e456 to 7705f53 Compare July 2, 2025 10:37

sphish force-pushed the main branch from 8ff19f5 to bdd119f Compare July 22, 2025 03:33

Merge branch 'main-upstream_public' into feat/cu_mem_api

c1d3606

# Conflicts: # csrc/deep_ep.cpp

fzyzcjy mentioned this pull request Sep 15, 2025

One branch that contains several optimizations and features #405

Draft

fzyzcjy added 6 commits October 28, 2025 16:02

Merge branch 'main-upstream_public' into feat/cu_mem_api

84ff679

# Conflicts: # csrc/deep_ep.cpp

add flag

a465298

add test

2e61613

fix

0d3a994

more

491eb95

apply

94fc763

fzyzcjy mentioned this pull request Oct 28, 2025

Fine grained overlap (cleaned) #468

Open

LyricZhao approved these changes Nov 4, 2025

View reviewed changes

LyricZhao merged commit 9249c25 into deepseek-ai:main Nov 4, 2025
1 check passed

youkaichao reviewed Nov 5, 2025

View reviewed changes

Support cuMem API in cross process shared memory management #217

Support cuMem API in cross process shared memory management #217

Uh oh!

Conversation

fzyzcjy commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkaichao Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

youkaichao Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shifangx commented Oct 22, 2025

Uh oh!

fzyzcjy commented Oct 22, 2025

Uh oh!

Uh oh!

youkaichao Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

youkaichao Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

fzyzcjy commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fzyzcjy commented Jun 17, 2025 •

edited

Loading

fzyzcjy Jun 18, 2025 •

edited

Loading

fzyzcjy Nov 5, 2025 •

edited

Loading