Skip to content

[Performance]: Low GPUDirect RDMA Bandwidth (~15 GB/s) compared to CPU RDMA (~47 GB/s). #1459

@UnpureRationalist

Description

@UnpureRationalist

Describe your performance question

Description:

Hi there,

We are benchmarking the transfer engine on two machines, each equipped with 8 GPUs and 8 RDMA NICs (400 Gbps). Our test configuration targets a 1-to-1 mapping using GPU 0 and NIC mlx5_0. We observed a significant performance discrepancy between CPU and GPU memory settings:

Test Results:

CPU RDMA (DRAM): When compiled with default options, the benchmark allocates memory in DRAM. We achieved a bandwidth of approximately 47 GB/s, which is consistent with the hardware's line rate.

GPU RDMA (VRAM): When compiled with -DUSE_CUDA=ON, the benchmark allocates VRAM on the GPU. In this case, the bandwidth drops to approximately 15 GB/s.

Environment & Configuration:

Hardware: 8x 400Gbps NICs, 8x GPUs.

Software Config: Benchmark pinned to mlx5_0 and gpu:0. All other parameters are kept at default.

Expectation: We expected GPUDirect RDMA to achieve performance closer to the CPU RDMA results(We achieve this result when use UCCL or NIXL with UCX backend).

Questions:

Are there specific compilation flags or runtime environment variables (e.g., UCX, NCCL, or PeerDirect settings) required to fully enable GPUDirect RDMA optimizations?

Does the transfer engine require specific PCIe topology awareness to avoid the observed 15 GB/s bottleneck?

We are looking for your insights or any advice on tuning the GPU RDMA performance. Thanks

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions