Skip to content

Releases: ROCm/rccl

therock-7.11

11 Feb 14:09
4b295c9

Choose a tag to compare

therock release v7.11

therock-7.10

11 Dec 17:31
d23d18f

Choose a tag to compare

therock release v7.10

RCCL 2.27.7 for ROCm 7.2.0

21 Jan 18:57
0d2c4fd

Choose a tag to compare

Changed

  • RCCL error messages have been made more verbose in several cases. RCCL now prints out fatal error messages by default. Fatal error messages can be suppressed by setting NCCL_DEBUG=NONE.
  • Disabled reduceCopyPacks pipelining for gfx950.

RCCL 2.27.7 for ROCm 7.1.1

26 Nov 06:41
bf3ebf5

Choose a tag to compare

Resolved Issues

  • Fixed a single node data corruption issue in MSCCL on the Instinct MI350X and MI355X for the LL protocol. This previously affected about 2% of the runs for single node AllReduce with inputs smaller than 512 KiB.

RCCL 2.27.7 for ROCm 7.1.0

30 Oct 05:22
22e3a85

Choose a tag to compare

Added

  • Added RCCL_P2P_BATCH_THRESHOLD to set the message size limit for batching P2P operations. This mainly affects small message performance for alltoall at a large scale but also applies to alltoallv.
  • Added RCCL_P2P_BATCH_ENABLE to enable batching P2P operations to receive performance gains for smaller messages up to 4MB for alltoall when the workload requires it. This is to avoid performance dips for larger messages.

Changed

  • The MSCCL++ feature is now disabled by default. The --disable-mscclpp build flag is replaced with --enable-mscclpp in the rccl/install.sh script.
  • Compatibility with NCCL 2.27.7

Resolved issues

  • Improve small message performance for alltoall by enabling and optimizing batched P2P operations.

Known issues

  • Symmetric memory kernels are currently disabled due to ongoing CUMEM enablement work.

RCCL 2.26.6 for ROCm 7.0.2

10 Oct 12:09
01dfdac

Choose a tag to compare

Added

  • Enabled double-buffering in reduceCopyPacks to trigger pipelining, especially to overlap bf16 arithmetic.
  • Added --force-reduce-pipeline as an option that can be passed to the install.sh script. Passing this option will enable software-triggered pipelining bfloat16 reductions (i.e. all_reduce, reduce_scatter and reduce).

rccl 2.26.6 for ROCm 7.0.1

17 Sep 16:41
ed65777

Choose a tag to compare

RCCL code for ROCm 7.0.1 did not change. The library was rebuilt for the updated ROCm 7.0.1 stack.

RCCL 2.26.6 for ROCm 7.0.0

16 Sep 06:37
ed65777

Choose a tag to compare

Resolved issues

  • Resolved an issue when using more than 64 channels when multiple collectives are used in the same ncclGroup() call.
  • Fixed unit test failures in tests ending with ManagedMem and ManagedMemGraph suffixes.
  • Suboptimal algorithmic switching point for AllReduce on MI300x.
  • Fixed the known issue "When splitting a communicator using ncclCommSplit in some GPU configurations, MSCCL initialization can cause a segmentation fault." with a design change to use comm instead of rank for mscclStatus. The Global map for comm to mscclStatus is still not thread safe but should be explicitly handled by mutexes for read writes. This is tested for correctness, but there is a plan to use a thread-safe map data structure in upcoming changes.

Added

  • Added support for extended fine-grained system memory pool.
  • Added new GPU target gfx950.
  • Added support for unroll=1 in device-code generation to improve performance.
  • Set a default of 112 channels for a single node with 8 * gfx950.
  • Enabled LL128 protocol on gfx950.
  • Adding ability to choose unroll factor at runtime via RCCL_UNROLL_FACTOR. This can be set at runtime to 1, 2, or 4. This change currently increases compilation and linking time because it triples the number of kernels generated.
  • Added MSCCL support for AllGather multinode gfx942/gfx950 (i.e., 16 and 32 GPUs). To enable, set the environment variable RCCL_MSCCL_FORCE_ENABLE=1. Max message size for MSCCL AllGather usage is 12292 * sizeof(datatype) * nGPUs.
  • Thread thresholds for LL/LL128 are selected in Tuning Models for the MI300X. This impacts the number of channels used for AG and RS. Channel tuning model is bypassed if NCCL_THREAD_THRESHOLDS, NCCL_MIN_NCHANNELS', or 'NCCL_MAX_NCHANNELS are set.
  • Multi-node tuning for AllGather, AllReduce, and ReduceScatter that leverages LL/LL64/LL128 protocol to use nontemporal vector load/store for tunable message size ranges.
  • LL/LL128 usage ranges for AR, AG, and RS are part of the tuning models, which enable architecture-specific tuning in conjunction with the existing Rome Models scheme in RCCL.
  • Two new APIs are exposed as part of an initiative to separate RCCL code. These APIs are rcclGetAlgoInfo and rcclFuncMaxSendRecvCount. However, user-level invocation requires that RCCL be built with RCCL_EXPOSE_STATIC enabled.

Changed

  • Compatibility with NCCL 2.23.4
  • Compatibility with NCCL 2.24.3
  • Compatibility with NCCL 2.25.1
  • Compatibility with NCCL 2.26.6

rccl 2.22.3 for ROCm 6.4.4

24 Sep 14:01

Choose a tag to compare

RCCL code for ROCm 6.4.4 did not change. The library was rebuilt for the updated ROCm 6.4.4 stack.

rccl 2.22.3 for ROCm 6.4.3

07 Aug 14:20

Choose a tag to compare

RCCL code for ROCm 6.4.3 did not change. The library was rebuilt for the updated ROCm 6.4.3 stack.