prov/lnx: Device to Device cudaMemcpy not synchronized

**Describe the bug**
`cudaMemcpy` operations between devices are not synchronized when using the `lnx` provider. This results in ongoing transfers reported to the host as completed before they actually complete. In particular, `osu_bibw` with validation `-c` often fails, or reports unreasonably high bandwidth. This is related to results discussed with @amirshehataornl in https://github.com/open-mpi/ompi/issues/13156

**To Reproduce**
Compiled libfabric with lnx and cuda support, then run `osu_bibw -c D D`:
```
mpirun -np 2 -x FI_SHM_USE_XPMEM=1 -x FI_HMEM_CUDA_USE_GDRCOPY=1 -x FI_LNX_PROV_LINKS="shm" -mca pml cm -mca mtl ofi --mca opal_common_ofi_provider_include "lnx" -map-by numa -prtemca ras_base_launch_orted_on_hn 1 -mca mtl_ofi_av table ~/gpubind_pmix.sh ./osu_bibw -c D D
[x1000c1s0b0n0:668142] SET FI_SHM_USE_XPMEM=1
[x1000c1s0b0n0:668142] SET FI_HMEM_CUDA_USE_GDRCOPY=1
[x1000c1s0b0n0:668142] SET FI_LNX_PROV_LINKS=shm
rank 0 local 0 gpu 0
rank 1 local 1 gpu 1

# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)        Validation
1                       0.58              Pass
2                       1.18              Fail
```
The above Fails may be hard to reproduce, depending on the system. In case when there is no failure the reported bandwidths for large messages are too high (max 200GB/s for the GH200 system tested here):

```
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)        Validation
1                       0.58              Pass
2                       1.18              Pass
4                       2.35              Pass
8                       4.70              Pass
16                      9.41              Pass
32                     18.73              Pass
64                     37.69              Pass
128                    75.20              Pass
256                   150.05              Pass
512                   297.85              Pass
1024                  600.33              Pass
2048                 1193.13              Pass
4096                 2384.77              Pass
8192                 4768.02              Pass
16384                9550.22              Pass
32768               19117.57              Pass
65536               38159.45              Pass
131072              77169.28              Pass
262144             157666.12              Pass
524288             318422.65              Pass
1048576            638673.26              Pass
2097152           1281063.50              Pass
4194304           2554071.38              Pass
```

**Expected behavior**
After adding calls to `cuda_set_sync_memops` i n `lnx_trecv` and `lnx_tsenddata` (see patch) the results validate and the bandwidth for large messages is more reasonable:

```
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)        Validation
1                       0.22              Pass
2                       0.43              Pass
4                       0.87              Pass
8                       1.73              Pass
16                      3.45              Pass
32                      6.90              Pass
64                     13.84              Pass
128                    27.57              Pass
256                    54.94              Pass
512                   110.11              Pass
1024                  219.93              Pass
2048                  437.83              Pass
4096                  871.59              Pass
8192                 1717.63              Pass
16384                3415.92              Pass
32768                6793.29              Pass
65536               13068.11              Pass
131072              24836.90              Pass
262144              45147.18              Pass
524288              75769.13              Pass
1048576            114856.73              Pass
2097152            155360.10              Pass
4194304            191511.74              Pass
```

The attached patch is most likely wrong/incomplete in general, but does demonstrate the point for Device - Device transfers with Cuda.

```
diff --git a/prov/lnx/src/lnx_ops.c b/prov/lnx/src/lnx_ops.c
index 41879b086..b6120bf38 100644
--- a/prov/lnx/src/lnx_ops.c
+++ b/prov/lnx/src/lnx_ops.c
@@ -453,6 +453,7 @@ ssize_t lnx_trecv(struct fid_ep *ep, void *buf, size_t len, void *desc,
        struct lnx_ep *lep;
        const struct iovec iov = {.iov_base = buf, .iov_len = len};
 
+       cuda_set_sync_memops(buf);
        lep = container_of(ep, struct lnx_ep, le_ep.ep_fid.fid);
        if (!lep)
                return -FI_ENOSYS;
@@ -664,6 +665,7 @@ ssize_t lnx_tsenddata(struct fid_ep *ep, const void *buf, size_t len, void *desc
        fi_addr_t core_addr;
        void *core_desc = desc;
 
+       cuda_set_sync_memops(buf);
        lep = container_of(ep, struct lnx_ep, le_ep.ep_fid.fid);
        if (!lep)
                return -FI_ENOSYS;
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prov/lnx: Device to Device cudaMemcpy not synchronized #11231

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

prov/lnx: Device to Device cudaMemcpy not synchronized #11231

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions