-
Notifications
You must be signed in to change notification settings - Fork 77
Add GpuIpcMemHandle
#704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add GpuIpcMemHandle
#704
Conversation
Co-authored-by: Copilot <[email protected]>
- [x] Move hash specialization and equality operator from std/global namespace to custom namespace - [x] Update unordered_map to use custom hash and equality as template parameters - [x] Add noexcept to equality operator - [x] Verify the changes build correctly - [x] Run code review and security checks <!-- START COPILOT CODING AGENT TIPS --> --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/mscclpp/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: Binyang2014 <[email protected]> Co-authored-by: Binyang Li <[email protected]>
5598d49 to
fcb1ab6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request introduces a new GpuIpcMem class to encapsulate GPU IPC (Inter-Process Communication) memory management, replacing the previous inline implementation in RegisteredMemory. The refactoring simplifies memory handle management by consolidating three different handle types (RuntimeIpc, PosixFd, and Fabric) into a unified abstraction layer.
Key changes include:
- Introduction of
GpuIpcMemHandlestruct andGpuIpcMemclass for managing GPU IPC memory handles - Significant simplification of
RegisteredMemoryimplementation by delegating IPC handle management to the new classes - Migration from old
debug.hlogging system to newlogger.hppsystem with GPU subsystem support - Test infrastructure improvements including MPI group caching and better resource cleanup
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/gpu_ipc_mem.cc | New file implementing GpuIpcMem class with support for RuntimeIpc, PosixFd, and Fabric handle types |
| src/include/gpu_ipc_mem.hpp | New header defining GpuIpcMemHandle struct and GpuIpcMem class interfaces |
| src/registered_memory.cc | Refactored to use GpuIpcMem abstraction, removing ~100 lines of inline IPC handle management code |
| src/include/registered_memory.hpp | Updated to use GpuIpcMemHandle in TransportInfo union and removed obsolete fields |
| src/include/logger.hpp | Added GPU subsystem to LogSubsys enum for GPU-related logging |
| test/mp_unit/executor_tests.cc | Moved test prerequisites check from test body to SetUp() method |
| python/test/mscclpp_mpi.py | Added MPI group caching and improved fixture cleanup with try-finally |
| python/test/conftest.py | New conftest for MPI initialization before test collection |
| python/mscclpp/utils.py | Added stdin=subprocess.DEVNULL to prevent subprocess hanging |
| include/mscclpp/gpu.hpp | Added CUDA_ERROR_NOT_SUPPORTED macro mapping for HIP compatibility |
| src/gpu_utils.cc | Removed unused nvlsCompatibleMemHandleType variable |
Comments suppressed due to low confidence (1)
src/registered_memory.cc:27
- The WARN macro call should use mscclpp::LogSubsys::P2P enum value instead of mscclpp::P2P. The logger.hpp defines LogSubsys enum with GPU, NET, CONN, EXEC, NCCL values. P2P is not defined in LogSubsys. Consider using GPU subsystem here since this is GPU IPC related code.
WARN("Call to " #cmd " failed, error is %s", errStr); \
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
Binyang2014
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to update switch_channel.cc?
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Add
GpuIpcMemHandlethat is a generic GPU memory handle that covers all existing methods for GPU memory mapping. This PR fixes issues that fail to properly fallback to a feasible type of memory handle on the importing environment. It also consolidates code for creating or destroying various memory handles into a single RAII wrapper.