[Help Wanted] High-performance send/recv implementation for Mooncake PG

## Motivation

Mooncake PG is a `torch.distributed` backend built on top of Mooncake TE, providing a full set of collective communication primitives. While most collectives are performance-oriented, the current implementation of point-to-point primitives (`torch.send` / `torch.recv`) is **significantly under-optimized** and has become a performance bottleneck in practice.

We are looking for help to design and implement a **high-performance send/recv path** for Mooncake PG.

Improving send/recv performance directly unlocks **better scalability and responsiveness for elastic MoE and LLM serving systems**, where dynamic data movement is on the critical path.

## Use Case

A representative use case is **Expert Parallel Load Balancing (EPLB)** in LLM serving. In this workflow, expert parameters are dynamically exchanged across ranks using `torch.distributed.batch_isend_irecv`, which relies heavily on efficient send/recv under the hood.

Typical characteristics:

- Frequent point-to-point transfers
- Large payloads (can be up to 100 MB per message)
- Sensitivity to bandwidth

Example: https://github.com/sgl-project/sglang/pull/12068 relies on high performance send/recv of Mooncake.

## Constraints / Assumptions

- Tensors are **CUDA tensors**
- The transfer engine already supports **GPUDirect RDMA**
- Send/recv is largely independent from other collective implementations
  - Only the transfer-engine initialization logic is acturally shared

## Notes for Contributors

- Please do not limit designs to the current implementation, which is known to be inefficient
- Alternative designs, data paths, or abstractions are all welcome

## Where to Start

If you’re interested in contributing, the following entry points may be helpful:
- `MooncakeBackend::send`, `MooncakeBackend::recv` implementations in `mooncake_backend.cpp`
- PR #1236, which initially introduces send/recv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help Wanted] High-performance send/recv implementation for Mooncake PG #1421

Motivation

Use Case

Constraints / Assumptions

Notes for Contributors

Where to Start

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Help Wanted] High-performance send/recv implementation for Mooncake PG #1421

Description

Motivation

Use Case

Constraints / Assumptions

Notes for Contributors

Where to Start

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions