-
Notifications
You must be signed in to change notification settings - Fork 457
Description
Describe the bug
cudaMemcpy operations between devices are not synchronized when using the lnx provider. This results in ongoing transfers reported to the host as completed before they actually complete. In particular, osu_bibw with validation -c often fails, or reports unreasonably high bandwidth. This is related to results discussed with @amirshehataornl in open-mpi/ompi#13156
To Reproduce
Compiled libfabric with lnx and cuda support, then run osu_bibw -c D D:
mpirun -np 2 -x FI_SHM_USE_XPMEM=1 -x FI_HMEM_CUDA_USE_GDRCOPY=1 -x FI_LNX_PROV_LINKS="shm" -mca pml cm -mca mtl ofi --mca opal_common_ofi_provider_include "lnx" -map-by numa -prtemca ras_base_launch_orted_on_hn 1 -mca mtl_ofi_av table ~/gpubind_pmix.sh ./osu_bibw -c D D
[x1000c1s0b0n0:668142] SET FI_SHM_USE_XPMEM=1
[x1000c1s0b0n0:668142] SET FI_HMEM_CUDA_USE_GDRCOPY=1
[x1000c1s0b0n0:668142] SET FI_LNX_PROV_LINKS=shm
rank 0 local 0 gpu 0
rank 1 local 1 gpu 1
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s) Validation
1 0.58 Pass
2 1.18 Fail
The above Fails may be hard to reproduce, depending on the system. In case when there is no failure the reported bandwidths for large messages are too high (max 200GB/s for the GH200 system tested here):
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s) Validation
1 0.58 Pass
2 1.18 Pass
4 2.35 Pass
8 4.70 Pass
16 9.41 Pass
32 18.73 Pass
64 37.69 Pass
128 75.20 Pass
256 150.05 Pass
512 297.85 Pass
1024 600.33 Pass
2048 1193.13 Pass
4096 2384.77 Pass
8192 4768.02 Pass
16384 9550.22 Pass
32768 19117.57 Pass
65536 38159.45 Pass
131072 77169.28 Pass
262144 157666.12 Pass
524288 318422.65 Pass
1048576 638673.26 Pass
2097152 1281063.50 Pass
4194304 2554071.38 Pass
Expected behavior
After adding calls to cuda_set_sync_memops i n lnx_trecv and lnx_tsenddata (see patch) the results validate and the bandwidth for large messages is more reasonable:
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s) Validation
1 0.22 Pass
2 0.43 Pass
4 0.87 Pass
8 1.73 Pass
16 3.45 Pass
32 6.90 Pass
64 13.84 Pass
128 27.57 Pass
256 54.94 Pass
512 110.11 Pass
1024 219.93 Pass
2048 437.83 Pass
4096 871.59 Pass
8192 1717.63 Pass
16384 3415.92 Pass
32768 6793.29 Pass
65536 13068.11 Pass
131072 24836.90 Pass
262144 45147.18 Pass
524288 75769.13 Pass
1048576 114856.73 Pass
2097152 155360.10 Pass
4194304 191511.74 Pass
The attached patch is most likely wrong/incomplete in general, but does demonstrate the point for Device - Device transfers with Cuda.
diff --git a/prov/lnx/src/lnx_ops.c b/prov/lnx/src/lnx_ops.c
index 41879b086..b6120bf38 100644
--- a/prov/lnx/src/lnx_ops.c
+++ b/prov/lnx/src/lnx_ops.c
@@ -453,6 +453,7 @@ ssize_t lnx_trecv(struct fid_ep *ep, void *buf, size_t len, void *desc,
struct lnx_ep *lep;
const struct iovec iov = {.iov_base = buf, .iov_len = len};
+ cuda_set_sync_memops(buf);
lep = container_of(ep, struct lnx_ep, le_ep.ep_fid.fid);
if (!lep)
return -FI_ENOSYS;
@@ -664,6 +665,7 @@ ssize_t lnx_tsenddata(struct fid_ep *ep, const void *buf, size_t len, void *desc
fi_addr_t core_addr;
void *core_desc = desc;
+ cuda_set_sync_memops(buf);
lep = container_of(ep, struct lnx_ep, le_ep.ep_fid.fid);
if (!lep)
return -FI_ENOSYS;