-
Notifications
You must be signed in to change notification settings - Fork 575
Description
Motivation
Mooncake PG is a torch.distributed backend built on top of Mooncake TE, providing a full set of collective communication primitives. While most collectives are performance-oriented, the current implementation of point-to-point primitives (torch.send / torch.recv) is significantly under-optimized and has become a performance bottleneck in practice.
We are looking for help to design and implement a high-performance send/recv path for Mooncake PG.
Improving send/recv performance directly unlocks better scalability and responsiveness for elastic MoE and LLM serving systems, where dynamic data movement is on the critical path.
Use Case
A representative use case is Expert Parallel Load Balancing (EPLB) in LLM serving. In this workflow, expert parameters are dynamically exchanged across ranks using torch.distributed.batch_isend_irecv, which relies heavily on efficient send/recv under the hood.
Typical characteristics:
- Frequent point-to-point transfers
- Large payloads (can be up to 100 MB per message)
- Sensitivity to bandwidth
Example: sgl-project/sglang#12068 relies on high performance send/recv of Mooncake.
Constraints / Assumptions
- Tensors are CUDA tensors
- The transfer engine already supports GPUDirect RDMA
- Send/recv is largely independent from other collective implementations
- Only the transfer-engine initialization logic is acturally shared
Notes for Contributors
- Please do not limit designs to the current implementation, which is known to be inefficient
- Alternative designs, data paths, or abstractions are all welcome
Where to Start
If you’re interested in contributing, the following entry points may be helpful:
MooncakeBackend::send,MooncakeBackend::recvimplementations inmooncake_backend.cpp- PR [EP] Implement send/recv #1236, which initially introduces send/recv.