Skip to content

ryanhankins/open-ofi-xccl

 
 

Repository files navigation

Open OFI XCCL

Open OFI XCCL is a plug-in which enables developers to use libfabric as a network provider while running collective communication applications based on either NVIDIA's NCCL or AMD's RCCL (ROCm Communication Collectives Library).

Overview

Machine learning frameworks running on top of NVIDIA or AMD GPUs use libraries such as NCCL and RCCL to provide standard collective communication routines for multiple GPUs across single or multiple nodes.

This project implements a plug-in that maps NCCL and RCCL's connection-oriented transport APIs to libfabric's connection-less reliable interface. This allows NCCL and RCCL applications to benefit from libfabric's transport layer services such as reliable message support and OS bypass.

Getting Started

Release version numbers that end in -aws or -xccl indicate the platform or vendor focus. While earlier versions were tested primarily on AWS EC2 instances and EFA, the current maintained version is tested and supported by HPE Slingshot.

It is recommended to check out a release tag that ends in "-xccl" rather than using a development build.

Basic Requirements

The plugin is regularly tested on Cray EX Slingshot systems.

To build the plugin, you need to have Libfabric and HWLOC installed prior to building the plugin. If you want to run the included multi-node tests, you also need an MPI Implementation installed. Each release of the plugin has a list of dependency versions in the top-level README.md file.

The plugin does not require NCCL or RCCL to be pre-installed, but obviously a NCCL or RCCL installation is required to use the plugin.

Most Libfabric providers should work with the plugin, possibly through a utility provider. The plugin generally requires Reliable datagram endpoints (FI_EP_RDM) with tagged messaging (FI_TAGGED, FI_MSG). This is similar to the requirements of most MPI implementations and a generally tested path in Libfabric. For GPUDirect RDMA support, the plugin also requires FI_HMEM support, as well as RDMA support.

Getting Help

If you have any issues building or using the package or if you think you may have found a bug, please open an issue in this repository. For additional help or support, please file a bug in the issue tracker.

Contributing

Reporting issues and sending pull requests are always welcome.

License

This library is licensed under the Apache 2.0 License.

Ownership

This project is maintained and supported by HPE Slingshot.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 74.4%
  • C 16.7%
  • M4 6.8%
  • Other 2.1%