Releases: ROCm/rccl
Releases · ROCm/rccl
therock-7.11
therock release v7.11
therock-7.10
therock release v7.10
RCCL 2.27.7 for ROCm 7.2.0
Changed
- RCCL error messages have been made more verbose in several cases. RCCL now prints out fatal error messages by default. Fatal error messages can be suppressed by setting
NCCL_DEBUG=NONE. - Disabled
reduceCopyPackspipelining forgfx950.
RCCL 2.27.7 for ROCm 7.1.1
Resolved Issues
- Fixed a single node data corruption issue in MSCCL on the Instinct MI350X and MI355X for the LL protocol. This previously affected about 2% of the runs for single node AllReduce with inputs smaller than 512 KiB.
RCCL 2.27.7 for ROCm 7.1.0
Added
- Added
RCCL_P2P_BATCH_THRESHOLDto set the message size limit for batching P2P operations. This mainly affects small message performance for alltoall at a large scale but also applies to alltoallv. - Added
RCCL_P2P_BATCH_ENABLEto enable batching P2P operations to receive performance gains for smaller messages up to 4MB for alltoall when the workload requires it. This is to avoid performance dips for larger messages.
Changed
- The MSCCL++ feature is now disabled by default. The
--disable-mscclppbuild flag is replaced with--enable-mscclppin therccl/install.shscript. - Compatibility with NCCL 2.27.7
Resolved issues
- Improve small message performance for alltoall by enabling and optimizing batched P2P operations.
Known issues
- Symmetric memory kernels are currently disabled due to ongoing CUMEM enablement work.
RCCL 2.26.6 for ROCm 7.0.2
Added
- Enabled double-buffering in
reduceCopyPacksto trigger pipelining, especially to overlap bf16 arithmetic. - Added
--force-reduce-pipelineas an option that can be passed to theinstall.shscript. Passing this option will enable software-triggered pipeliningbfloat16reductions (i.e.all_reduce,reduce_scatterandreduce).
rccl 2.26.6 for ROCm 7.0.1
RCCL code for ROCm 7.0.1 did not change. The library was rebuilt for the updated ROCm 7.0.1 stack.
RCCL 2.26.6 for ROCm 7.0.0
Resolved issues
- Resolved an issue when using more than 64 channels when multiple collectives are used in the same
ncclGroup()call. - Fixed unit test failures in tests ending with
ManagedMemandManagedMemGraphsuffixes. - Suboptimal algorithmic switching point for AllReduce on MI300x.
- Fixed the known issue "When splitting a communicator using
ncclCommSplitin some GPU configurations, MSCCL initialization can cause a segmentation fault." with a design change to usecomminstead ofrankformscclStatus. The Global map forcommtomscclStatusis still not thread safe but should be explicitly handled by mutexes for read writes. This is tested for correctness, but there is a plan to use a thread-safe map data structure in upcoming changes.
Added
- Added support for extended fine-grained system memory pool.
- Added new GPU target
gfx950. - Added support for
unroll=1in device-code generation to improve performance. - Set a default of 112 channels for a single node with
8 * gfx950. - Enabled LL128 protocol on
gfx950. - Adding ability to choose unroll factor at runtime via
RCCL_UNROLL_FACTOR. This can be set at runtime to 1, 2, or 4. This change currently increases compilation and linking time because it triples the number of kernels generated. - Added MSCCL support for AllGather multinode gfx942/gfx950 (i.e., 16 and 32 GPUs). To enable, set the environment variable
RCCL_MSCCL_FORCE_ENABLE=1. Max message size for MSCCL AllGather usage is12292 * sizeof(datatype) * nGPUs. - Thread thresholds for LL/LL128 are selected in Tuning Models for the MI300X. This impacts the number of channels used for AG and RS. Channel tuning model is bypassed if
NCCL_THREAD_THRESHOLDS,NCCL_MIN_NCHANNELS', or 'NCCL_MAX_NCHANNELSare set. - Multi-node tuning for AllGather, AllReduce, and ReduceScatter that leverages LL/LL64/LL128 protocol to use nontemporal vector load/store for tunable message size ranges.
- LL/LL128 usage ranges for AR, AG, and RS are part of the tuning models, which enable architecture-specific tuning in conjunction with the existing Rome Models scheme in RCCL.
- Two new APIs are exposed as part of an initiative to separate RCCL code. These APIs are
rcclGetAlgoInfoandrcclFuncMaxSendRecvCount. However, user-level invocation requires that RCCL be built withRCCL_EXPOSE_STATICenabled.
Changed
- Compatibility with NCCL 2.23.4
- Compatibility with NCCL 2.24.3
- Compatibility with NCCL 2.25.1
- Compatibility with NCCL 2.26.6
rccl 2.22.3 for ROCm 6.4.4
RCCL code for ROCm 6.4.4 did not change. The library was rebuilt for the updated ROCm 6.4.4 stack.
rccl 2.22.3 for ROCm 6.4.3
RCCL code for ROCm 6.4.3 did not change. The library was rebuilt for the updated ROCm 6.4.3 stack.