-
Notifications
You must be signed in to change notification settings - Fork 575
Description
Describe your performance question
Description:
Hi there,
We are benchmarking the transfer engine on two machines, each equipped with 8 GPUs and 8 RDMA NICs (400 Gbps). Our test configuration targets a 1-to-1 mapping using GPU 0 and NIC mlx5_0. We observed a significant performance discrepancy between CPU and GPU memory settings:
Test Results:
CPU RDMA (DRAM): When compiled with default options, the benchmark allocates memory in DRAM. We achieved a bandwidth of approximately 47 GB/s, which is consistent with the hardware's line rate.
GPU RDMA (VRAM): When compiled with -DUSE_CUDA=ON, the benchmark allocates VRAM on the GPU. In this case, the bandwidth drops to approximately 15 GB/s.
Environment & Configuration:
Hardware: 8x 400Gbps NICs, 8x GPUs.
Software Config: Benchmark pinned to mlx5_0 and gpu:0. All other parameters are kept at default.
Expectation: We expected GPUDirect RDMA to achieve performance closer to the CPU RDMA results(We achieve this result when use UCCL or NIXL with UCX backend).
Questions:
Are there specific compilation flags or runtime environment variables (e.g., UCX, NCCL, or PeerDirect settings) required to fully enable GPUDirect RDMA optimizations?
Does the transfer engine require specific PCIe topology awareness to avoid the observed 15 GB/s bottleneck?
We are looking for your insights or any advice on tuning the GPU RDMA performance. Thanks
Before submitting a new issue...
- Make sure you already searched for relevant issues and read the documentation