Skip to content

Commit 2abd7e3

Browse files
Allow NVSHMEM PE to NIC to be initialized by rank
The `nvshmemi_get_devices_by_distance` default initialization method in NVSHMEM does not work optimally for GPU configurations where 2 GPU and 2 RDMA NIC share a PCIe bus, such as the x86 based GCP A3 Ultra H200 and A4 B200 instance types: https://cloud.google.com/compute/docs/gpus/gpu-network-bandwidth#h200-gpus. GPU0 and GPU1 (on two independent processes) can observe NIC0, NIC1 on the same PCIe switch are equidistant and result in both GPUs leveraging NIC0, halving the observed bandwidth for RDMA in test_internode.py and in vLLM wide-EP. The alternative is a static mapping between GPU host index (PE) and NIC index (HCA), but the NVSHMEMX_INIT_WITH_UNIQUEID initialization method bypasses setting `mype_node` and `npes_node`. The `nvshmemi_boot_handle.pg_rank` for this initialization method is always 0 and the `nvshmem_boot_handle.pg_size` is always 2, preventing NVSHMEM_ENABLE_NIC_PE_MAPPING from leveraging a static list of devices in transport.cpp#nvshmemi_setup_connections: selected_devices[0] = nvshmemi_state->mype_node % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); has mype_node = 0 for all devices. To allow static assignment, introduce a DEEP_EP_DEVICE_TO_HCA_MAPPING environment variable during Buffer python initialization that accepts `<cuda_device_id>:<HCA_name>:<HCA_port>` and resolves `torch.cuda.current_device()` to set NVSHMEM_HCA_LIST to the appropriate value or error. Co-Authored-By: Keon Jang <[email protected]> Signed-off-by: Clayton Coleman <[email protected]>
1 parent bfded34 commit 2abd7e3

File tree

1 file changed

+25
-0
lines changed

1 file changed

+25
-0
lines changed

deep_ep/buffer.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,8 @@ def all_gather_object(obj):
101101
# Synchronize NVSHMEM unique IDs
102102
root_unique_id = None
103103
if self.runtime.get_num_rdma_ranks() > 1 or low_latency_mode:
104+
self._setup_device_hca_mapping()
105+
104106
# Enable IBGDA
105107
assert num_qps_per_rank > 0
106108
os.environ['NVSHMEM_DISABLE_P2P'] = '0' if allow_nvlink_for_low_latency_mode else '1'
@@ -133,6 +135,29 @@ def all_gather_object(obj):
133135
self.runtime.sync(device_ids, ipc_handles, root_unique_id)
134136
assert self.runtime.is_available()
135137

138+
def _setup_device_hca_mapping(self):
139+
"""
140+
Set up device to NIC mapping using DEEP_EP_DEVICE_TO_HCA_MAPPING environment variable.
141+
The mapping format is: "0:mlx5_0:1,1:mlx5_1:1,..." where each entry maps a CUDA device ID
142+
to an HCA name separated by colon. HCA name can include additional suffixes like ":1".
143+
"""
144+
if 'DEEP_EP_DEVICE_TO_HCA_MAPPING' in os.environ:
145+
device_mapping = {}
146+
mapping_str = os.environ['DEEP_EP_DEVICE_TO_HCA_MAPPING']
147+
# Parse mapping string like "0:mlx5_0:1,1:mlx5_1:1,..."
148+
for mapping in mapping_str.split(','):
149+
assert ':' in mapping, f"Invalid mapping format '{mapping}' in DEEP_EP_DEVICE_TO_HCA_MAPPING. Expected format: '<device_id>:<hca_name>'"
150+
parts = mapping.split(':', 1) # Split only on first colon
151+
device_id = int(parts[0])
152+
hca_name = parts[1] # Keep the rest as HCA name (including :1)
153+
device_mapping[device_id] = hca_name
154+
155+
# Get current device and set appropriate HCA
156+
current_device = torch.cuda.current_device()
157+
assert current_device in device_mapping, f"Current CUDA device {current_device} not found in DEEP_EP_DEVICE_TO_HCA_MAPPING"
158+
os.environ['NVSHMEM_ENABLE_PE_MAPPING'] = '1'
159+
os.environ['NVSHMEM_HCA_LIST'] = device_mapping[current_device]
160+
136161
def destroy(self):
137162
"""
138163
Destroy the cpp runtime and release resources.

0 commit comments

Comments
 (0)