-
Notifications
You must be signed in to change notification settings - Fork 306
RCCL Allreduce Backend Implementation #7493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -3931,6 +3931,14 @@ if test $GPU_SUPPORT = "CUDA" ; then | |||
fi | |||
fi | |||
|
|||
if test $GPU_SUPPORT = "HIP" ; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add build-time detection for the RCCL library. Enables RCCL backend if RCCL is installed.
typedef struct MPIR_NCCLcomm { | ||
ncclUniqueId id; | ||
ncclComm_t ncclcomm; | ||
cudaStream_t stream; | ||
} MPIR_NCCLcomm; | ||
#endif /*ENABLE_NCCL */ | ||
|
||
#ifdef ENABLE_RCCL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extend the internal CCL abstraction to support RCCL, mirroring the structure used for NCCL.
break; | ||
#endif | ||
|
||
#ifdef ENABLE_RCCL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add RCCL to the backend selection logic for Allreduce. If CVAR specifies rccl
or if tuning logic selects it, the RCCL Allreduce function is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment at the case MPIR_CVAR_ALLREDUCE_CCL_auto:
to indicate we need to decide how to do auto for CCL. We cannot have two CCL_auto case, if anyone trying to build MPICH with both NCCL and RCCL support.
@@ -1550,6 +1550,7 @@ cvars: | |||
CCL to use for CCL allreduce | |||
auto - Internal algorithm selection (use the value from the .json tuning file) | |||
nccl - Force NCCL | |||
rccl - Force RCCL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added rccl
as a valid value for the Allreduce CCL CVAR.
@@ -150,6 +150,8 @@ int get_ccl_from_string(const char *ccl_str) | |||
ccl = MPIR_CVAR_ALLREDUCE_CCL_auto; | |||
else if (0 == strcmp(ccl_str, "nccl")) | |||
ccl = MPIR_CVAR_ALLREDUCE_CCL_nccl; | |||
else if (0 == strcmp(ccl_str, "rccl")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parse user input for backend selection. Now recognizes rccl
as a valid keyword in the environment variable string.
src/mpi/coll/src/csel.c
Outdated
@@ -1361,4 +1361,4 @@ void *MPIR_Csel_search(void *csel_, MPIR_Csel_coll_sig_s coll_info) | |||
|
|||
fn_exit: | |||
return node ? node->u.cnt.container : NULL; | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't mean to change this oops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the PR to remove this change
@@ -427,6 +427,8 @@ void *MPII_Create_container(struct json_object *obj) | |||
else if (!strcmp(ckey, "algorithm=MPIR_Allreduce_intra_k_reduce_scatter_allgather")) | |||
cnt->id = | |||
MPII_CSEL_CONTAINER_TYPE__ALGORITHM__MPIR_Allreduce_intra_k_reduce_scatter_allgather; | |||
else if (!strcmp(ckey, "algorithm=MPIR_Allreduce_intra_ccl")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support RCCL-based algorithms in JSON tuning files. Recognizes MPIR_Allreduce_intra_ccl
as a selectable algorithm ID, enabling runtime tuning to choose RCCL dynamically.
@@ -6,4 +6,5 @@ | |||
mpi_core_sources += \ | |||
src/util/ccl/cclcomm.c \ | |||
src/util/ccl/ccl_impl.c \ | |||
src/util/ccl/nccl.c | |||
src/util/ccl/nccl.c \ | |||
src/util/ccl/rccl.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include new RCCL backend source file in the build.
@@ -42,6 +46,15 @@ int MPIR_CCLcomm_free(MPIR_Comm * comm_ptr) | |||
} | |||
#endif | |||
|
|||
#ifdef ENABLE_RCCL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integrating the RCCL communicator with existing communicator logic.
@@ -151,6 +151,7 @@ static int MPIR_NCCL_get_datatype(MPI_Datatype dtype, ncclDataType_t * nccl_dtyp | |||
break; | |||
case MPIR_FLOAT16: | |||
*nccl_dtype = ncclFloat16; | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added missing break statement in NCCL backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this!
* Copyright (C) by Argonne National Laboratory | ||
* See COPYRIGHT in top-level directory | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement the actual RCCL-based collective logic. This file is a functional mirror of nccl.c
, customized for ROCm, containing:
- Initialization of HIP streams and communicators.
- Compatibility checks for operations and datatypes.
- Core Allreduce functionality using
ncclAllReduce
over HIP. - Cleanup and error handling logic.
@@ -4,3 +4,8 @@ allred 4 arg=-counts=10,100 arg=-oddmemtype=device arg=-evenmemtype=device env=M | |||
allred 7 arg=-counts=10 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1 | |||
allred2 4 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1 | |||
allred2 12 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extend collective tests to include RCCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need some changes in Jenkins to run these new collective tests properly @yfguo
@@ -890,6 +890,14 @@ if test "$have_gpu" = "no" ; then | |||
AC_SUBST([hip_CPPFLAGS]) | |||
AC_SUBST([hip_LDFLAGS]) | |||
AC_SUBST([hip_LIBS]) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set up test environment for HIP/RCCL support.
test/mpi/runtests
Outdated
@@ -449,6 +449,7 @@ sub LoadTests { | |||
if ($test_opt->{gpu}) { | |||
push @{$test_opt->{envs}}, "MPIR_CVAR_ENABLE_GPU=1"; | |||
} elsif ($_f eq "testlist.gpu" or $curdir =~ /\bcuda$/) { | |||
# } elsif ($_f eq "testlist.gpu" or $curdir =~ /\b(?:cuda|hip)$/) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configure runtime test environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @ggracelii! Just a few minor comments for you.
Also please squash all of your commits to one. You can do this using an interactive rebase: https://www.freecodecamp.org/news/git-squash-commits/
fallback: | ||
return MPIR_Allreduce_allcomm_auto(sendbuf, recvbuf, count, datatype, op, comm_ptr, errflag); | ||
return MPIR_Allreduce_intra_recursive_doubling(sendbuf, recvbuf, count, datatype, op, comm_ptr, errflag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the fallback from auto to recursive doubling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPIR_Allreduce_allcomm_auto
loops back to the RCCL logic;
case MPII_CSEL_CONTAINER_TYPE__ALGORITHM__MPIR_Allreduce_intra_ccl:
mpi_errno = MPIR_Allreduce_intra_ccl(sendbuf, recvbuf, count, datatype, op, comm_ptr,
cnt->u.allreduce.intra_ccl.ccl, errflag);
break;
If RCCL requirements are not met, it falls back to MPIR_Allreduce_allcomm_auto
. This causes an infinite recursive loop which resulted in a segfault when I ran the benchmarks.
src/mpi/coll/src/csel.c
Outdated
@@ -1361,4 +1361,4 @@ void *MPIR_Csel_search(void *csel_, MPIR_Csel_coll_sig_s coll_info) | |||
|
|||
fn_exit: | |||
return node ? node->u.cnt.container : NULL; | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the PR to remove this change
@@ -151,6 +151,7 @@ static int MPIR_NCCL_get_datatype(MPI_Datatype dtype, ncclDataType_t * nccl_dtyp | |||
break; | |||
case MPIR_FLOAT16: | |||
*nccl_dtype = ncclFloat16; | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this!
src/util/ccl/rccl.c
Outdated
if (unlikely((ret) != hipSuccess)) \ | ||
goto fn_fail | ||
|
||
#define NCCL_ERR_CHECK(ret) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend calling this RCCL_ERR_CHECK
so you do not get a re-definition error if you try to compile with both NCCL and RCCL, even though I recognize it is identical to NCCL_ERR_CHECK
because RCCL uses NCCL values like ncclSuccess
src/util/ccl/rccl.c
Outdated
ncclGetUniqueId(&(rcclcomm->id)); | ||
} | ||
|
||
mpi_errno = MPIR_Bcast_impl(&(rcclcomm->id), sizeof(rcclcomm->id), MPIR_UINT8, 0, comm_ptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just be MPIR_Bcast(
. This is also wrong in the NCCL implementation
@@ -4,3 +4,8 @@ allred 4 arg=-counts=10,100 arg=-oddmemtype=device arg=-evenmemtype=device env=M | |||
allred 7 arg=-counts=10 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1 | |||
allred2 4 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1 | |||
allred2 12 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need some changes in Jenkins to run these new collective tests properly @yfguo
39fc268
to
96760f4
Compare
280463a
to
38adb2b
Compare
configure.ac
Outdated
@@ -3928,6 +3928,14 @@ if test $GPU_SUPPORT = "CUDA" ; then | |||
fi | |||
fi | |||
|
|||
if test $GPU_SUPPORT = "HIP" ; then | |||
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl.h], [rccl], [ncclCommInitAll]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be rcclCommInitAll
instead of ncclCommInitAll
, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm. I didn't realized that RCCL is generating nccl-prefixed names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl/rccl.h], [rccl], [ncclCommInitAll],, [-D__HIP_PLATFORM_AMD__])
.
configure.ac
Outdated
@@ -3928,6 +3928,14 @@ if test $GPU_SUPPORT = "CUDA" ; then | |||
fi | |||
fi | |||
|
|||
if test $GPU_SUPPORT = "HIP" ; then | |||
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl.h], [rccl], [ncclCommInitAll]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl/rccl.h], [rccl], [ncclCommInitAll],, [-D__HIP_PLATFORM_AMD__])
.
test/mpi/configure.ac
Outdated
@@ -890,6 +890,14 @@ if test "$have_gpu" = "no" ; then | |||
AC_SUBST([hip_CPPFLAGS]) | |||
AC_SUBST([hip_LDFLAGS]) | |||
AC_SUBST([hip_LIBS]) | |||
|
|||
if test "$pac_have_hip" = "yes" ; then | |||
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl.h], [rccl], [ncclCommInitAll]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the configure.ac
, this line should be PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl/rccl.h], [rccl], [ncclCommInitAll],, [-D__HIP_PLATFORM_AMD__])
.
break; | ||
#endif | ||
|
||
#ifdef ENABLE_RCCL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment at the case MPIR_CVAR_ALLREDUCE_CCL_auto:
to indicate we need to decide how to do auto for CCL. We cannot have two CCL_auto case, if anyone trying to build MPICH with both NCCL and RCCL support.
4b22c4b
to
fbded5b
Compare
85d9ea2
to
7ad80bc
Compare
How to Build and Run RCCL Allreduce Tests in MPICHI tested these on 2 JLSE nodes with AMD GPUs. Configure and Build MPICH Test Suitecd ~/grace_mpich/test/mpi # use your MPICH directory
./autogen.sh
export HIP_PATH=/soft/compilers/rocm/rocm-6.3.2 # use your HIP path
export HIPCC=$HIP_PATH/bin/hipcc
export PATH=$HIP_PATH/bin:$PATH
./configure --with-mpi=/home/grace.li/grace_mpich/build/install # use your install path Build Test Programsmake -j $(nproc) To double check all binaries exist, run: find coll/ -maxdepth 1 -type f -executable -name 'allred*' You should see these coll/allred
coll/allred2
coll/allred3
coll/allred4
coll/allred5
coll/allred6
coll/allredmany
coll/allred_derived
coll/allred_float Prepare RCCL Allreduce Test Scriptcd coll Create a file named amdgpu02
amdgpu03 Create a file named #!/bin/bash
set -e
# path to your mpiexec
MPIEXEC=/home/grace.li/grace_mpich/build/install/bin/mpiexec
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 4 -ppn 2 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred -counts=10,100 -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 7 -ppn 4 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred -counts=10 -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 4 -ppn 2 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred2 -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 8 -ppn 4 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred2 -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 4 -ppn 2 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred3 -counts=1000 -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 6 -ppn 3 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred4 -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 8 -ppn 4 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred5 -counts=10,1000 -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 8 -ppn 4 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred6 -counts=100000 -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 6 -ppn 3 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allredmany -counts=10 -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 8 -ppn 4 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred_derived -memtype=all
MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 7 -ppn 4 \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
-genv MPIR_CVAR_ALLREDUCE_CCL rccl \
./allred_float -counts=10,10000 -memtype=all Run TestsMake the script executable and run it: chmod +x runtests_rccl_allred.sh
./runtests_rccl_allred.sh You should see No Errors printed in your terminal for each of the tests if everything passes correctly. |
7ad80bc
to
a2c0e30
Compare
Pull Request Description
This PR adds initial support for the AMD ROCm RCCL backend to MPICH’s collective communication layer, starting with the Allreduce operation. The implementation is modeled closely after the existing NCCL backend, preserving consistency in design and modularity across GPU backends.
Motivation
The RCCL backend delivers substantial performance improvements over MPI-native collectives on AMD GPUs, especially for large message sizes. As shown in the graphs, RCCL achieves up to 7× lower latency on a single node and 3× on two nodes compared to MPICH for 1 MiB messages.
Key Features
MPIR_RCCL_Allreduce
backend withhipStream
supportMPIR_CVAR_ALLREDUCE_CCL=rccl
This PR does not yet generalize other collectives to RCCL. The focus is only the Allreduce collective operation.
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.