-
Notifications
You must be signed in to change notification settings - Fork 18
update nccl tests image to support gb200 #981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for configurable NCCL tests image versions to enable GB200 architecture support. The changes introduce a new --nccl_tests_version parameter that allows users to specify architecture-specific image tags (e.g., "amd64" or "arm64") instead of using a hardcoded "latest" tag. However, several critical issues have been identified.
Key changes include:
- Addition of
nccl_tests_versionparameter throughout the execution pipeline - Migration from hardcoded "latest" image tags to templated version tags in all Kubernetes job manifests
- Significant modifications to Azure hostpath job configuration, including NCCL environment variables optimized for newer GPU architectures
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| modules/python/gpu/main.py | Adds nccl_tests_version parameter to the execute function with default value "amd64", propagates it through the replacements dictionary to YAML templates |
| modules/python/gpu/README.md | Documents the new NCCL_TESTS_VERSION environment variable with usage examples |
| modules/python/gpu/cfg/mpi/azure-job-topology.yaml | Updates image tags from hardcoded path with "latest" to templated path with {{nccl_tests_version}}, removes "azure/" prefix from image repository path |
| modules/python/gpu/cfg/mpi/azure-job-sriov.yaml | Updates image tags from hardcoded path with "latest" to templated path with {{nccl_tests_version}}, removes "azure/" prefix from image repository path |
| modules/python/gpu/cfg/mpi/azure-job-hostpath.yaml | Updates image tags, removes dynamic GPU/port calculation logic, replaces UCX-based InfiniBand configuration with NVLS/MNNVL settings for GB200 support |
| modules/python/gpu/cfg/mpi/aws-job.yaml | Updates image tags from "latest" to {{nccl_tests_version}} while maintaining the same repository path |
| containers: | ||
| - name: nccl-launcher | ||
| image: telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest | ||
| image: telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}} |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container image path has been changed from "telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest" to "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}", removing the "azure/" prefix. This is a breaking change that could cause deployment failures if the image doesn't exist at the new path. Please verify that the image repository at "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests" exists and contains the expected architecture-specific tags (e.g., "amd64", "arm64").
| containers: | ||
| - name: nccl-launcher | ||
| image: telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest | ||
| image: telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}} |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container image path has been changed from "telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest" to "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}", removing the "azure/" prefix. This is a breaking change that could cause deployment failures if the image doesn't exist at the new path. Please verify that the image repository at "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests" exists and contains the expected architecture-specific tags (e.g., "amd64", "arm64").
| containers: | ||
| - name: nccl-worker | ||
| image: telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest | ||
| image: telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}} |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container image path has been changed from "telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest" to "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}", removing the "azure/" prefix. This is a breaking change that could cause deployment failures if the image doesn't exist at the new path. Please verify that the image repository at "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests" exists and contains the expected architecture-specific tags (e.g., "amd64", "arm64").
| containers: | ||
| - name: nccl-worker | ||
| image: telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest | ||
| image: telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}} |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container image path has been changed from "telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest" to "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}", removing the "azure/" prefix. This is a breaking change that could cause deployment failures if the image doesn't exist at the new path. Please verify that the image repository at "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests" exists and contains the expected architecture-specific tags (e.g., "amd64", "arm64").
This pull request introduces support for specifying the NCCL tests image version (such as "amd64" or "arm64") throughout the GPU benchmarking workflow. The change allows users to select the appropriate image tag for their architecture, propagates this option from the command line through the Python code, and updates all relevant Kubernetes job templates to use the specified image version instead of a hardcoded "latest" tag. Additionally, some environment variable handling for Azure jobs has been streamlined.
Key changes include:
Support for configurable NCCL tests image version:
--nccl_tests_versioncommand-line argument (default: "amd64") to theexecutecommand inmain.py, and updated theexecutefunction and its documentation to accept and propagate this parameter. [1] [2] [3] [4]README.mdto include the newNCCL_TESTS_VERSIONenvironment variable and command-line argument. [1] [2]Kubernetes job template improvements:
{{nccl_tests_version}}variable for the container image tag instead of the hardcodedlatesttag, ensuring the correct image is used for the specified architecture. [1] [2] [3] [4] [5] [6] [7] [8]Azure job environment and command cleanup:
These changes make the benchmarking workflow more flexible and portable across different hardware architectures.