You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Support pipelining codegen and template specialization
* Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16)
* Remove need for FUNC_INDEX_TOTAL
* Add pipeline field to device function key construction logic
* Avoid unneeded codegen for LL/LL64 kernels
* Modify conditions and add pipeline dtypes env
* Optimize selection for both gfx942 and gfx950
* Increase pipeline bitfield width
* Use __forceinline__ for all device functions
* Realign reduceCopy with original form
* Add opt-out option to enable perf debugs
* Remove force-reduce-pipelining option from README
* Update CHANGELOG.md
---------
Co-authored-by: Jeffrey Novotny <[email protected]>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,10 +22,10 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https:
22
22
* Multi-node tuning for AllGather, AllReduce, and ReduceScatter that leverages LL/LL64/LL128 protocol to use nontemporal vector load/store for tunable message size ranges.
23
23
* LL/LL128 usage ranges for AR, AG, and RS are part of the tuning models, which enable architecture-specific tuning in conjunction with the existing Rome Models scheme in RCCL.
24
24
* Two new APIs are exposed as part of an initiative to separate RCCL code. These APIs are `rcclGetAlgoInfo` and `rcclFuncMaxSendRecvCount`. However, user-level invocation requires that RCCL be built with `RCCL_EXPOSE_STATIC` enabled.
25
-
* Enabled double-buffering in `reduceCopyPacks` to trigger pipelining, especially to overlap bf16 arithmetic.
26
-
* Added `--force-reduce-pipeline` as an option that can be passed to the `install.sh` script. Passing this option will enable software-triggered pipelining `bfloat16` reductions (i.e. `all_reduce`, `reduce_scatter` and `reduce`).
25
+
* Enabled double-buffering in `reduceCopyPacks` to trigger pipelining, especially to overlap `bf16` arithmetic and bridge the gap between `fp32` performance and `bf16` for both `gfx942` and `gfx950`. Pipelining has been made tunable via `rcclSetPipelining`, similar to algorithms/protocols so that regression is avoided in certain message sizes.
27
26
* Added a direct allgather algorithm. This is enabled by default for multi-node if there are 16 nodes or fewer. The message size threshold is 4MB.
# TODO: enable bf16 pipelining by default upon having the pipelined/scalar switching feature
854
-
# if (FORCE_REDUCE_PIPELINING AND (SOURCE_FILE MATCHES "gensrc/reduce_.*" OR SOURCE_FILE MATCHES "gensrc/reduce_scatter_.*" OR SOURCE_FILE MATCHES "gensrc/all_reduce_.*"))
855
-
# message(STATUS "RCCL_ENABLE_SW_PIPELINE enabled for ${SOURCE_FILE}")
-t|--tests_build Build rccl unit tests, but do not run
65
65
--time-trace Plot the build time of RCCL (requires `ninja-build` package installed on the system)
66
66
--verbose Show compile commands
67
-
--force-reduce-pipeline Force reduce_copy sw pipeline to be used for every reduce-based collectives and datatypes
68
67
```
69
68
70
69
By default, RCCL builds for all GPU targets defined in `DEFAULT_GPUS` in `CMakeLists.txt`. To target specific GPU(s), and potentially reduce build time, use `--amdgpu_targets` as a `;` separated string listing GPU(s) to target.
perl -pi -e 's/(template<typename T, typename RedOp(?:, typename Proto)?)(, bool isNetOffload.*?)?>/\1, int USE_ACC, int COLL_UNROLL\2>/g'"$HIP_FILE"
24
+
perl -pi -e 's/(template<typename T, typename RedOp(?:, typename Proto)?)(, bool isNetOffload.*?)?>/\1, int USE_ACC, int COLL_UNROLL, int Pipeline\2>/g'"$HIP_FILE"
25
+
perl -pi -e 's/(template<typename T, typename RedOp(?:, typename Proto)?(?:, int RCCLMetadata)?)(, bool isNetOffload.*?)?>/\1, int USE_ACC, int COLL_UNROLL, int Pipeline\2>/g'"$HIP_FILE"
0 commit comments