Skip to content

Conversation

@chhwang
Copy link
Contributor

@chhwang chhwang commented Dec 13, 2025

Add GpuIpcMemHandle that is a generic GPU memory handle that covers all existing methods for GPU memory mapping. This PR fixes issues that fail to properly fallback to a feasible type of memory handle on the importing environment. It also consolidates code for creating or destroying various memory handles into a single RAII wrapper.

Binyang2014 and others added 8 commits December 4, 2025 19:20
- [x] Move hash specialization and equality operator from std/global
namespace to custom namespace
- [x] Update unordered_map to use custom hash and equality as template
parameters
- [x] Add noexcept to equality operator
- [x] Verify the changes build correctly
- [x] Run code review and security checks

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for
you](https://github.com/microsoft/mscclpp/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot)
— coding agent works faster and does higher quality work when set up for
your repo.

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: Binyang2014 <[email protected]>
Co-authored-by: Binyang Li <[email protected]>
@chhwang chhwang changed the base branch from binyli/handle_cache to main December 16, 2025 22:51
@chhwang chhwang requested a review from Copilot December 16, 2025 22:52
@chhwang chhwang force-pushed the chhwang/new-ipc-mem branch from 5598d49 to fcb1ab6 Compare December 16, 2025 22:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a new GpuIpcMem class to encapsulate GPU IPC (Inter-Process Communication) memory management, replacing the previous inline implementation in RegisteredMemory. The refactoring simplifies memory handle management by consolidating three different handle types (RuntimeIpc, PosixFd, and Fabric) into a unified abstraction layer.

Key changes include:

  • Introduction of GpuIpcMemHandle struct and GpuIpcMem class for managing GPU IPC memory handles
  • Significant simplification of RegisteredMemory implementation by delegating IPC handle management to the new classes
  • Migration from old debug.h logging system to new logger.hpp system with GPU subsystem support
  • Test infrastructure improvements including MPI group caching and better resource cleanup

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/gpu_ipc_mem.cc New file implementing GpuIpcMem class with support for RuntimeIpc, PosixFd, and Fabric handle types
src/include/gpu_ipc_mem.hpp New header defining GpuIpcMemHandle struct and GpuIpcMem class interfaces
src/registered_memory.cc Refactored to use GpuIpcMem abstraction, removing ~100 lines of inline IPC handle management code
src/include/registered_memory.hpp Updated to use GpuIpcMemHandle in TransportInfo union and removed obsolete fields
src/include/logger.hpp Added GPU subsystem to LogSubsys enum for GPU-related logging
test/mp_unit/executor_tests.cc Moved test prerequisites check from test body to SetUp() method
python/test/mscclpp_mpi.py Added MPI group caching and improved fixture cleanup with try-finally
python/test/conftest.py New conftest for MPI initialization before test collection
python/mscclpp/utils.py Added stdin=subprocess.DEVNULL to prevent subprocess hanging
include/mscclpp/gpu.hpp Added CUDA_ERROR_NOT_SUPPORTED macro mapping for HIP compatibility
src/gpu_utils.cc Removed unused nvlsCompatibleMemHandleType variable
Comments suppressed due to low confidence (1)

src/registered_memory.cc:27

  • The WARN macro call should use mscclpp::LogSubsys::P2P enum value instead of mscclpp::P2P. The logger.hpp defines LogSubsys enum with GPU, NET, CONN, EXEC, NCCL values. P2P is not defined in LogSubsys. Consider using GPU subsystem here since this is GPU IPC related code.
      WARN("Call to " #cmd " failed, error is %s", errStr); \

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

@chhwang chhwang changed the title Add GpuIpcMem class Add GpuIpcMemHandle Jan 6, 2026
@chhwang chhwang requested a review from Binyang2014 January 6, 2026 08:53
@chhwang
Copy link
Contributor Author

chhwang commented Jan 6, 2026

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to update switch_channel.cc?

@chhwang
Copy link
Contributor Author

chhwang commented Jan 6, 2026

/azp run mscclpp-ut

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@chhwang chhwang requested a review from Binyang2014 January 7, 2026 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants