Add `GpuIpcMemHandle` #704

chhwang · 2025-12-13T00:51:06Z

Add GpuIpcMemHandle that is a generic GPU memory handle that covers all existing methods for GPU memory mapping. This PR fixes issues that fail to properly fallback to a feasible type of memory handle on the importing environment. It also consolidates code for creating or destroying various memory handles into a single RAII wrapper.

Co-authored-by: Copilot <[email protected]>

- [x] Move hash specialization and equality operator from std/global namespace to custom namespace - [x] Update unordered_map to use custom hash and equality as template parameters - [x] Add noexcept to equality operator - [x] Verify the changes build correctly - [x] Run code review and security checks  --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/mscclpp/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: Binyang2014 <[email protected]> Co-authored-by: Binyang Li <[email protected]>

Copilot

Pull request overview

This pull request introduces a new GpuIpcMem class to encapsulate GPU IPC (Inter-Process Communication) memory management, replacing the previous inline implementation in RegisteredMemory. The refactoring simplifies memory handle management by consolidating three different handle types (RuntimeIpc, PosixFd, and Fabric) into a unified abstraction layer.

Key changes include:

Introduction of GpuIpcMemHandle struct and GpuIpcMem class for managing GPU IPC memory handles
Significant simplification of RegisteredMemory implementation by delegating IPC handle management to the new classes
Migration from old debug.h logging system to new logger.hpp system with GPU subsystem support
Test infrastructure improvements including MPI group caching and better resource cleanup

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/gpu_ipc_mem.cc	New file implementing GpuIpcMem class with support for RuntimeIpc, PosixFd, and Fabric handle types
src/include/gpu_ipc_mem.hpp	New header defining GpuIpcMemHandle struct and GpuIpcMem class interfaces
src/registered_memory.cc	Refactored to use GpuIpcMem abstraction, removing ~100 lines of inline IPC handle management code
src/include/registered_memory.hpp	Updated to use GpuIpcMemHandle in TransportInfo union and removed obsolete fields
src/include/logger.hpp	Added GPU subsystem to LogSubsys enum for GPU-related logging
test/mp_unit/executor_tests.cc	Moved test prerequisites check from test body to SetUp() method
python/test/mscclpp_mpi.py	Added MPI group caching and improved fixture cleanup with try-finally
python/test/conftest.py	New conftest for MPI initialization before test collection
python/mscclpp/utils.py	Added stdin=subprocess.DEVNULL to prevent subprocess hanging
include/mscclpp/gpu.hpp	Added CUDA_ERROR_NOT_SUPPORTED macro mapping for HIP compatibility
src/gpu_utils.cc	Removed unused nvlsCompatibleMemHandleType variable

Comments suppressed due to low confidence (1)

src/registered_memory.cc:27

The WARN macro call should use mscclpp::LogSubsys::P2P enum value instead of mscclpp::P2P. The logger.hpp defines LogSubsys enum with GPU, NET, CONN, EXEC, NCCL values. P2P is not defined in LogSubsys. Consider using GPU subsystem here since this is GPU IPC related code.

      WARN("Call to " #cmd " failed, error is %s", errStr); \

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/registered_memory.cc

src/gpu_ipc_mem.cc

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

src/gpu_ipc_mem.cc

chhwang · 2026-01-06T09:17:21Z

/azp run

azure-pipelines · 2026-01-06T09:17:42Z

Azure Pipelines successfully started running 3 pipeline(s).

src/gpu_ipc_mem.cc

Binyang2014

Are you going to update switch_channel.cc?

src/gpu_ipc_mem.cc

chhwang · 2026-01-06T12:14:40Z

/azp run mscclpp-ut

azure-pipelines · 2026-01-06T12:14:52Z

Azure Pipelines successfully started running 1 pipeline(s).

Binyang2014 and others added 8 commits December 4, 2025 19:20

add ipc cache

70c1d4d

WIP

1739f5a

Update src/registered_memory.cc

4ebe37e

Co-authored-by: Copilot <[email protected]>

WIP

2137325

fix ut

b1029b9

Merge branch 'main' into binyli/handle_cache

d97d230

Add GpuIpcMem class

fcb1ab6

chhwang changed the base branch from binyli/handle_cache to main December 16, 2025 22:51

chhwang requested a review from Copilot December 16, 2025 22:52

Copilot started reviewing on behalf of chhwang December 16, 2025 22:53 View session

chhwang force-pushed the chhwang/new-ipc-mem branch from 5598d49 to fcb1ab6 Compare December 16, 2025 22:53

chhwang added 2 commits December 16, 2025 14:53

Merge branch 'main' into chhwang/new-ipc-mem

73982f7

revert

ebec0ee

Copilot AI reviewed Dec 16, 2025

View reviewed changes

src/registered_memory.cc Outdated Show resolved Hide resolved

src/gpu_ipc_mem.cc Outdated Show resolved Hide resolved

src/gpu_ipc_mem.cc Show resolved Hide resolved

src/gpu_ipc_mem.cc Show resolved Hide resolved

chhwang added 2 commits December 16, 2025 23:11

update

dc77036

update

c3d2c2b

chhwang mentioned this pull request Jan 4, 2026

[Feature] Registering cuMem address without NVLS or MNNVL support #714

Open

chhwang added 5 commits January 5, 2026 22:28

Merge branch 'main' into chhwang/new-ipc-mem

8eccca7

Lint

3a07282

tackle comments

77245e5

lint

61cc7d6

Merge branch 'main' into chhwang/new-ipc-mem

c3f467b

chhwang requested a review from Copilot January 6, 2026 07:35

Copilot started reviewing on behalf of chhwang January 6, 2026 07:35 View session

Copilot AI reviewed Jan 6, 2026

View reviewed changes

src/gpu_ipc_mem.cc Show resolved Hide resolved

src/gpu_ipc_mem.cc Outdated Show resolved Hide resolved

src/gpu_ipc_mem.cc Show resolved Hide resolved

src/gpu_ipc_mem.cc Outdated Show resolved Hide resolved

src/gpu_ipc_mem.cc Outdated Show resolved Hide resolved

chhwang added 3 commits January 6, 2026 08:32

add comments

61ee117

tackle comments

542800d

tackle comment

0d7f877

chhwang changed the title ~~Add GpuIpcMem class~~ Add GpuIpcMemHandle Jan 6, 2026

chhwang requested a review from Binyang2014 January 6, 2026 08:53

rocm fix

2ff8e1f

Binyang2014 reviewed Jan 6, 2026

View reviewed changes

src/gpu_ipc_mem.cc Outdated Show resolved Hide resolved

Binyang2014 reviewed Jan 6, 2026

View reviewed changes

src/gpu_ipc_mem.cc Show resolved Hide resolved

tackle comments

c99d344

more fix

0037490

chhwang requested a review from Binyang2014 January 7, 2026 06:38

Merge branch 'main' into chhwang/new-ipc-mem

2abc702

Add GpuIpcMemHandle #704

Are you sure you want to change the base?

Add GpuIpcMemHandle #704

Uh oh!

Conversation

chhwang commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chhwang commented Jan 6, 2026

Uh oh!

azure-pipelines bot commented Jan 6, 2026

Uh oh!

Uh oh!

Binyang2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chhwang commented Jan 6, 2026

Uh oh!

azure-pipelines bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `GpuIpcMemHandle` #704

Add `GpuIpcMemHandle` #704

chhwang commented Dec 13, 2025 •

edited

Loading