[Performance]: Low GPUDirect RDMA Bandwidth (~15 GB/s) compared to CPU RDMA (~47 GB/s).

### Describe your performance question

**Description:**

Hi there,

We are benchmarking the transfer engine on two machines, each equipped with 8 GPUs and 8 RDMA NICs (400 Gbps). Our test configuration targets a 1-to-1 mapping using GPU 0 and NIC mlx5_0. We observed a significant performance discrepancy between CPU and GPU memory settings:

**Test Results:**

CPU RDMA (DRAM): When compiled with default options, the benchmark allocates memory in DRAM. We achieved a bandwidth of approximately `47 GB/s`, which is consistent with the hardware's line rate.

GPU RDMA (VRAM): When compiled with `-DUSE_CUDA=ON`, the benchmark allocates VRAM on the GPU. In this case, the bandwidth drops to approximately `15 GB/s`.

**Environment & Configuration:**

Hardware: 8x 400Gbps NICs, 8x GPUs.

Software Config: Benchmark pinned to mlx5_0 and gpu:0. All other parameters are kept at default.

Expectation: We expected GPUDirect RDMA to achieve performance closer to the CPU RDMA results(We achieve this result when use UCCL or NIXL with UCX backend).

**Questions:**

Are there specific compilation flags or runtime environment variables (e.g., UCX, NCCL, or PeerDirect settings) required to fully enable GPUDirect RDMA optimizations?

Does the transfer engine require specific PCIe topology awareness to avoid the observed 15 GB/s bottleneck?

We are looking for your insights or any advice on tuning the GPU RDMA performance. Thanks

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues and read the [documentation](https://kvcache-ai.github.io/Mooncake/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Low GPUDirect RDMA Bandwidth (~15 GB/s) compared to CPU RDMA (~47 GB/s). #1459

Describe your performance question

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance]: Low GPUDirect RDMA Bandwidth (~15 GB/s) compared to CPU RDMA (~47 GB/s). #1459

Description

Describe your performance question

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions