Skip to content

[Help Wanted] High-performance send/recv implementation for Mooncake PG #1421

@UNIDY2002

Description

@UNIDY2002

Motivation

Mooncake PG is a torch.distributed backend built on top of Mooncake TE, providing a full set of collective communication primitives. While most collectives are performance-oriented, the current implementation of point-to-point primitives (torch.send / torch.recv) is significantly under-optimized and has become a performance bottleneck in practice.

We are looking for help to design and implement a high-performance send/recv path for Mooncake PG.

Improving send/recv performance directly unlocks better scalability and responsiveness for elastic MoE and LLM serving systems, where dynamic data movement is on the critical path.

Use Case

A representative use case is Expert Parallel Load Balancing (EPLB) in LLM serving. In this workflow, expert parameters are dynamically exchanged across ranks using torch.distributed.batch_isend_irecv, which relies heavily on efficient send/recv under the hood.

Typical characteristics:

  • Frequent point-to-point transfers
  • Large payloads (can be up to 100 MB per message)
  • Sensitivity to bandwidth

Example: sgl-project/sglang#12068 relies on high performance send/recv of Mooncake.

Constraints / Assumptions

  • Tensors are CUDA tensors
  • The transfer engine already supports GPUDirect RDMA
  • Send/recv is largely independent from other collective implementations
    • Only the transfer-engine initialization logic is acturally shared

Notes for Contributors

  • Please do not limit designs to the current implementation, which is known to be inefficient
  • Alternative designs, data paths, or abstractions are all welcome

Where to Start

If you’re interested in contributing, the following entry points may be helpful:

  • MooncakeBackend::send, MooncakeBackend::recv implementations in mooncake_backend.cpp
  • PR [EP] Implement send/recv #1236, which initially introduces send/recv.

Metadata

Metadata

Assignees

Labels

help wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions