Skip to content

Conversation

ggracelii
Copy link

@ggracelii ggracelii commented Jul 2, 2025

Pull Request Description

This PR adds initial support for the AMD ROCm RCCL backend to MPICH’s collective communication layer, starting with the Allreduce operation. The implementation is modeled closely after the existing NCCL backend, preserving consistency in design and modularity across GPU backends.

Motivation

The RCCL backend delivers substantial performance improvements over MPI-native collectives on AMD GPUs, especially for large message sizes. As shown in the graphs, RCCL achieves up to 7× lower latency on a single node and 3× on two nodes compared to MPICH for 1 MiB messages.

1node
2nodes

Key Features

  • Configure-time detection of RCCL
  • New MPIR_RCCL_Allreduce backend with hipStream support
  • Environment variable control via MPIR_CVAR_ALLREDUCE_CCL=rccl
  • Test infrastructure updates to support HIP-based testing
  • Designed to seamlessly coexist with other backends (e.g., NCCL)

This PR does not yet generalize other collectives to RCCL. The focus is only the Allreduce collective operation.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@ggracelii ggracelii changed the title Rccl allreduce RCCL Allreduce Backend Implementation Jul 2, 2025
@@ -3931,6 +3931,14 @@ if test $GPU_SUPPORT = "CUDA" ; then
fi
fi

if test $GPU_SUPPORT = "HIP" ; then
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add build-time detection for the RCCL library. Enables RCCL backend if RCCL is installed.

typedef struct MPIR_NCCLcomm {
ncclUniqueId id;
ncclComm_t ncclcomm;
cudaStream_t stream;
} MPIR_NCCLcomm;
#endif /*ENABLE_NCCL */

#ifdef ENABLE_RCCL
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend the internal CCL abstraction to support RCCL, mirroring the structure used for NCCL.

break;
#endif

#ifdef ENABLE_RCCL
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add RCCL to the backend selection logic for Allreduce. If CVAR specifies rccl or if tuning logic selects it, the RCCL Allreduce function is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment at the case MPIR_CVAR_ALLREDUCE_CCL_auto: to indicate we need to decide how to do auto for CCL. We cannot have two CCL_auto case, if anyone trying to build MPICH with both NCCL and RCCL support.

@@ -1550,6 +1550,7 @@ cvars:
CCL to use for CCL allreduce
auto - Internal algorithm selection (use the value from the .json tuning file)
nccl - Force NCCL
rccl - Force RCCL
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added rccl as a valid value for the Allreduce CCL CVAR.

@@ -150,6 +150,8 @@ int get_ccl_from_string(const char *ccl_str)
ccl = MPIR_CVAR_ALLREDUCE_CCL_auto;
else if (0 == strcmp(ccl_str, "nccl"))
ccl = MPIR_CVAR_ALLREDUCE_CCL_nccl;
else if (0 == strcmp(ccl_str, "rccl"))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse user input for backend selection. Now recognizes rccl as a valid keyword in the environment variable string.

@@ -1361,4 +1361,4 @@ void *MPIR_Csel_search(void *csel_, MPIR_Csel_coll_sig_s coll_info)

fn_exit:
return node ? node->u.cnt.container : NULL;
}
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't mean to change this oops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the PR to remove this change

@@ -427,6 +427,8 @@ void *MPII_Create_container(struct json_object *obj)
else if (!strcmp(ckey, "algorithm=MPIR_Allreduce_intra_k_reduce_scatter_allgather"))
cnt->id =
MPII_CSEL_CONTAINER_TYPE__ALGORITHM__MPIR_Allreduce_intra_k_reduce_scatter_allgather;
else if (!strcmp(ckey, "algorithm=MPIR_Allreduce_intra_ccl"))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support RCCL-based algorithms in JSON tuning files. Recognizes MPIR_Allreduce_intra_ccl as a selectable algorithm ID, enabling runtime tuning to choose RCCL dynamically.

@@ -6,4 +6,5 @@
mpi_core_sources += \
src/util/ccl/cclcomm.c \
src/util/ccl/ccl_impl.c \
src/util/ccl/nccl.c
src/util/ccl/nccl.c \
src/util/ccl/rccl.c
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include new RCCL backend source file in the build.

@@ -42,6 +46,15 @@ int MPIR_CCLcomm_free(MPIR_Comm * comm_ptr)
}
#endif

#ifdef ENABLE_RCCL
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrating the RCCL communicator with existing communicator logic.

@@ -151,6 +151,7 @@ static int MPIR_NCCL_get_datatype(MPI_Datatype dtype, ncclDataType_t * nccl_dtyp
break;
case MPIR_FLOAT16:
*nccl_dtype = ncclFloat16;
break;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added missing break statement in NCCL backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

* Copyright (C) by Argonne National Laboratory
* See COPYRIGHT in top-level directory
*/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement the actual RCCL-based collective logic. This file is a functional mirror of nccl.c, customized for ROCm, containing:

  • Initialization of HIP streams and communicators.
  • Compatibility checks for operations and datatypes.
  • Core Allreduce functionality using ncclAllReduce over HIP.
  • Cleanup and error handling logic.

@@ -4,3 +4,8 @@ allred 4 arg=-counts=10,100 arg=-oddmemtype=device arg=-evenmemtype=device env=M
allred 7 arg=-counts=10 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1
allred2 4 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1
allred2 12 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend collective tests to include RCCL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need some changes in Jenkins to run these new collective tests properly @yfguo

@@ -890,6 +890,14 @@ if test "$have_gpu" = "no" ; then
AC_SUBST([hip_CPPFLAGS])
AC_SUBST([hip_LDFLAGS])
AC_SUBST([hip_LIBS])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set up test environment for HIP/RCCL support.

@@ -449,6 +449,7 @@ sub LoadTests {
if ($test_opt->{gpu}) {
push @{$test_opt->{envs}}, "MPIR_CVAR_ENABLE_GPU=1";
} elsif ($_f eq "testlist.gpu" or $curdir =~ /\bcuda$/) {
# } elsif ($_f eq "testlist.gpu" or $curdir =~ /\b(?:cuda|hip)$/) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configure runtime test environment.

Copy link
Contributor

@mjwilkins18 mjwilkins18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @ggracelii! Just a few minor comments for you.

Also please squash all of your commits to one. You can do this using an interactive rebase: https://www.freecodecamp.org/news/git-squash-commits/

fallback:
return MPIR_Allreduce_allcomm_auto(sendbuf, recvbuf, count, datatype, op, comm_ptr, errflag);
return MPIR_Allreduce_intra_recursive_doubling(sendbuf, recvbuf, count, datatype, op, comm_ptr, errflag);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the fallback from auto to recursive doubling?

Copy link
Author

@ggracelii ggracelii Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPIR_Allreduce_allcomm_auto loops back to the RCCL logic;

case MPII_CSEL_CONTAINER_TYPE__ALGORITHM__MPIR_Allreduce_intra_ccl:
           mpi_errno = MPIR_Allreduce_intra_ccl(sendbuf, recvbuf, count, datatype, op, comm_ptr,
                                                cnt->u.allreduce.intra_ccl.ccl, errflag);
           break;

If RCCL requirements are not met, it falls back to MPIR_Allreduce_allcomm_auto. This causes an infinite recursive loop which resulted in a segfault when I ran the benchmarks.

@@ -1361,4 +1361,4 @@ void *MPIR_Csel_search(void *csel_, MPIR_Csel_coll_sig_s coll_info)

fn_exit:
return node ? node->u.cnt.container : NULL;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the PR to remove this change

@@ -151,6 +151,7 @@ static int MPIR_NCCL_get_datatype(MPI_Datatype dtype, ncclDataType_t * nccl_dtyp
break;
case MPIR_FLOAT16:
*nccl_dtype = ncclFloat16;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

if (unlikely((ret) != hipSuccess)) \
goto fn_fail

#define NCCL_ERR_CHECK(ret) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend calling this RCCL_ERR_CHECK so you do not get a re-definition error if you try to compile with both NCCL and RCCL, even though I recognize it is identical to NCCL_ERR_CHECK because RCCL uses NCCL values like ncclSuccess

ncclGetUniqueId(&(rcclcomm->id));
}

mpi_errno = MPIR_Bcast_impl(&(rcclcomm->id), sizeof(rcclcomm->id), MPIR_UINT8, 0, comm_ptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just be MPIR_Bcast(. This is also wrong in the NCCL implementation

@@ -4,3 +4,8 @@ allred 4 arg=-counts=10,100 arg=-oddmemtype=device arg=-evenmemtype=device env=M
allred 7 arg=-counts=10 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1
allred2 4 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1
allred2 12 arg=-oddmemtype=device arg=-evenmemtype=device env=MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl env=MPIR_CVAR_ALLREDUCE_CCL=nccl env=MPIR_CVAR_ENABLE_GPU=1 env=MPIR_CVAR_ODD_EVEN_CLIQUES=0 env=MPIR_CVAR_NUM_CLIQUES=1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need some changes in Jenkins to run these new collective tests properly @yfguo

@ggracelii ggracelii force-pushed the rccl_allreduce branch 2 times, most recently from 39fc268 to 96760f4 Compare July 3, 2025 16:10
@ggracelii ggracelii requested a review from mjwilkins18 July 3, 2025 16:43
@ggracelii ggracelii force-pushed the rccl_allreduce branch 6 times, most recently from 280463a to 38adb2b Compare July 7, 2025 18:06
configure.ac Outdated
@@ -3928,6 +3928,14 @@ if test $GPU_SUPPORT = "CUDA" ; then
fi
fi

if test $GPU_SUPPORT = "HIP" ; then
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl.h], [rccl], [ncclCommInitAll])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be rcclCommInitAll instead of ncclCommInitAll, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. I didn't realized that RCCL is generating nccl-prefixed names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl/rccl.h], [rccl], [ncclCommInitAll],, [-D__HIP_PLATFORM_AMD__]).

configure.ac Outdated
@@ -3928,6 +3928,14 @@ if test $GPU_SUPPORT = "CUDA" ; then
fi
fi

if test $GPU_SUPPORT = "HIP" ; then
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl.h], [rccl], [ncclCommInitAll])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl/rccl.h], [rccl], [ncclCommInitAll],, [-D__HIP_PLATFORM_AMD__]).

@@ -890,6 +890,14 @@ if test "$have_gpu" = "no" ; then
AC_SUBST([hip_CPPFLAGS])
AC_SUBST([hip_LDFLAGS])
AC_SUBST([hip_LIBS])

if test "$pac_have_hip" = "yes" ; then
PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl.h], [rccl], [ncclCommInitAll])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the configure.ac, this line should be PAC_CHECK_HEADER_LIB_OPTIONAL([rccl], [rccl/rccl.h], [rccl], [ncclCommInitAll],, [-D__HIP_PLATFORM_AMD__]).

break;
#endif

#ifdef ENABLE_RCCL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment at the case MPIR_CVAR_ALLREDUCE_CCL_auto: to indicate we need to decide how to do auto for CCL. We cannot have two CCL_auto case, if anyone trying to build MPICH with both NCCL and RCCL support.

@ggracelii ggracelii force-pushed the rccl_allreduce branch 3 times, most recently from 4b22c4b to fbded5b Compare July 17, 2025 20:05
@ggracelii ggracelii force-pushed the rccl_allreduce branch 4 times, most recently from 85d9ea2 to 7ad80bc Compare July 21, 2025 17:34
@ggracelii
Copy link
Author

ggracelii commented Jul 21, 2025

How to Build and Run RCCL Allreduce Tests in MPICH

I tested these on 2 JLSE nodes with AMD GPUs.

Configure and Build MPICH Test Suite

cd ~/grace_mpich/test/mpi # use your MPICH directory
./autogen.sh
export HIP_PATH=/soft/compilers/rocm/rocm-6.3.2 # use your HIP path
export HIPCC=$HIP_PATH/bin/hipcc
export PATH=$HIP_PATH/bin:$PATH
./configure --with-mpi=/home/grace.li/grace_mpich/build/install  # use your install path

Build Test Programs

make -j $(nproc)

To double check all binaries exist, run:

find coll/ -maxdepth 1 -type f -executable -name 'allred*'

You should see these allred executables:

coll/allred
coll/allred2
coll/allred3
coll/allred4
coll/allred5
coll/allred6
coll/allredmany
coll/allred_derived
coll/allred_float

Prepare RCCL Allreduce Test Script

cd coll

Create a file named hosts.txt with the hostnames of the nodes you're using. Sample:

amdgpu02
amdgpu03

Create a file named runtests_rccl_allred.sh with the following content:

#!/bin/bash

set -e

# path to your mpiexec
MPIEXEC=/home/grace.li/grace_mpich/build/install/bin/mpiexec

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 4 -ppn 2 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred -counts=10,100 -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 7 -ppn 4 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred -counts=10 -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 4 -ppn 2 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred2 -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 8 -ppn 4 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred2 -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 4 -ppn 2 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred3 -counts=1000 -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 6 -ppn 3 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred4 -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 8 -ppn 4 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred5 -counts=10,1000 -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 8 -ppn 4 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred6 -counts=100000 -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 6 -ppn 3 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allredmany -counts=10 -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 8 -ppn 4 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred_derived -memtype=all

MPIR_CVAR_ENABLE_GPU=1 $MPIEXEC -f hosts.txt -n 7 -ppn 4 \
    -genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM ccl \
    -genv MPIR_CVAR_ALLREDUCE_CCL rccl \
    ./allred_float -counts=10,10000 -memtype=all

Run Tests

Make the script executable and run it:

chmod +x runtests_rccl_allred.sh
./runtests_rccl_allred.sh

You should see No Errors printed in your terminal for each of the tests if everything passes correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants