Skip to content

prov/lnx: Device to Device cudaMemcpy not synchronized #11231

@angainor

Description

@angainor

Describe the bug
cudaMemcpy operations between devices are not synchronized when using the lnx provider. This results in ongoing transfers reported to the host as completed before they actually complete. In particular, osu_bibw with validation -c often fails, or reports unreasonably high bandwidth. This is related to results discussed with @amirshehataornl in open-mpi/ompi#13156

To Reproduce
Compiled libfabric with lnx and cuda support, then run osu_bibw -c D D:

mpirun -np 2 -x FI_SHM_USE_XPMEM=1 -x FI_HMEM_CUDA_USE_GDRCOPY=1 -x FI_LNX_PROV_LINKS="shm" -mca pml cm -mca mtl ofi --mca opal_common_ofi_provider_include "lnx" -map-by numa -prtemca ras_base_launch_orted_on_hn 1 -mca mtl_ofi_av table ~/gpubind_pmix.sh ./osu_bibw -c D D
[x1000c1s0b0n0:668142] SET FI_SHM_USE_XPMEM=1
[x1000c1s0b0n0:668142] SET FI_HMEM_CUDA_USE_GDRCOPY=1
[x1000c1s0b0n0:668142] SET FI_LNX_PROV_LINKS=shm
rank 0 local 0 gpu 0
rank 1 local 1 gpu 1

# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)        Validation
1                       0.58              Pass
2                       1.18              Fail

The above Fails may be hard to reproduce, depending on the system. In case when there is no failure the reported bandwidths for large messages are too high (max 200GB/s for the GH200 system tested here):

# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)        Validation
1                       0.58              Pass
2                       1.18              Pass
4                       2.35              Pass
8                       4.70              Pass
16                      9.41              Pass
32                     18.73              Pass
64                     37.69              Pass
128                    75.20              Pass
256                   150.05              Pass
512                   297.85              Pass
1024                  600.33              Pass
2048                 1193.13              Pass
4096                 2384.77              Pass
8192                 4768.02              Pass
16384                9550.22              Pass
32768               19117.57              Pass
65536               38159.45              Pass
131072              77169.28              Pass
262144             157666.12              Pass
524288             318422.65              Pass
1048576            638673.26              Pass
2097152           1281063.50              Pass
4194304           2554071.38              Pass

Expected behavior
After adding calls to cuda_set_sync_memops i n lnx_trecv and lnx_tsenddata (see patch) the results validate and the bandwidth for large messages is more reasonable:

# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)        Validation
1                       0.22              Pass
2                       0.43              Pass
4                       0.87              Pass
8                       1.73              Pass
16                      3.45              Pass
32                      6.90              Pass
64                     13.84              Pass
128                    27.57              Pass
256                    54.94              Pass
512                   110.11              Pass
1024                  219.93              Pass
2048                  437.83              Pass
4096                  871.59              Pass
8192                 1717.63              Pass
16384                3415.92              Pass
32768                6793.29              Pass
65536               13068.11              Pass
131072              24836.90              Pass
262144              45147.18              Pass
524288              75769.13              Pass
1048576            114856.73              Pass
2097152            155360.10              Pass
4194304            191511.74              Pass

The attached patch is most likely wrong/incomplete in general, but does demonstrate the point for Device - Device transfers with Cuda.

diff --git a/prov/lnx/src/lnx_ops.c b/prov/lnx/src/lnx_ops.c
index 41879b086..b6120bf38 100644
--- a/prov/lnx/src/lnx_ops.c
+++ b/prov/lnx/src/lnx_ops.c
@@ -453,6 +453,7 @@ ssize_t lnx_trecv(struct fid_ep *ep, void *buf, size_t len, void *desc,
        struct lnx_ep *lep;
        const struct iovec iov = {.iov_base = buf, .iov_len = len};
 
+       cuda_set_sync_memops(buf);
        lep = container_of(ep, struct lnx_ep, le_ep.ep_fid.fid);
        if (!lep)
                return -FI_ENOSYS;
@@ -664,6 +665,7 @@ ssize_t lnx_tsenddata(struct fid_ep *ep, const void *buf, size_t len, void *desc
        fi_addr_t core_addr;
        void *core_desc = desc;
 
+       cuda_set_sync_memops(buf);
        lep = container_of(ep, struct lnx_ep, le_ep.ep_fid.fid);
        if (!lep)
                return -FI_ENOSYS;

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions