Skip to content

Conversation

@anson627
Copy link
Collaborator

This pull request introduces support for specifying the NCCL tests image version (such as "amd64" or "arm64") throughout the GPU benchmarking workflow. The change allows users to select the appropriate image tag for their architecture, propagates this option from the command line through the Python code, and updates all relevant Kubernetes job templates to use the specified image version instead of a hardcoded "latest" tag. Additionally, some environment variable handling for Azure jobs has been streamlined.

Key changes include:

Support for configurable NCCL tests image version:

  • Added a new --nccl_tests_version command-line argument (default: "amd64") to the execute command in main.py, and updated the execute function and its documentation to accept and propagate this parameter. [1] [2] [3] [4]
  • Updated the example usage in README.md to include the new NCCL_TESTS_VERSION environment variable and command-line argument. [1] [2]

Kubernetes job template improvements:

  • Modified all AWS and Azure NCCL job YAML templates to use the {{nccl_tests_version}} variable for the container image tag instead of the hardcoded latest tag, ensuring the correct image is used for the specified architecture. [1] [2] [3] [4] [5] [6] [7] [8]

Azure job environment and command cleanup:

  • Simplified the Azure job hostpath YAML by removing unused environment variables and command logic, and updated NCCL-related environment variables for clarity and correctness.

These changes make the benchmarking workflow more flexible and portable across different hardware architectures.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for configurable NCCL tests image versions to enable GB200 architecture support. The changes introduce a new --nccl_tests_version parameter that allows users to specify architecture-specific image tags (e.g., "amd64" or "arm64") instead of using a hardcoded "latest" tag. However, several critical issues have been identified.

Key changes include:

  • Addition of nccl_tests_version parameter throughout the execution pipeline
  • Migration from hardcoded "latest" image tags to templated version tags in all Kubernetes job manifests
  • Significant modifications to Azure hostpath job configuration, including NCCL environment variables optimized for newer GPU architectures

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
modules/python/gpu/main.py Adds nccl_tests_version parameter to the execute function with default value "amd64", propagates it through the replacements dictionary to YAML templates
modules/python/gpu/README.md Documents the new NCCL_TESTS_VERSION environment variable with usage examples
modules/python/gpu/cfg/mpi/azure-job-topology.yaml Updates image tags from hardcoded path with "latest" to templated path with {{nccl_tests_version}}, removes "azure/" prefix from image repository path
modules/python/gpu/cfg/mpi/azure-job-sriov.yaml Updates image tags from hardcoded path with "latest" to templated path with {{nccl_tests_version}}, removes "azure/" prefix from image repository path
modules/python/gpu/cfg/mpi/azure-job-hostpath.yaml Updates image tags, removes dynamic GPU/port calculation logic, replaces UCX-based InfiniBand configuration with NVLS/MNNVL settings for GB200 support
modules/python/gpu/cfg/mpi/aws-job.yaml Updates image tags from "latest" to {{nccl_tests_version}} while maintaining the same repository path

containers:
- name: nccl-launcher
image: telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest
image: telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container image path has been changed from "telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest" to "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}", removing the "azure/" prefix. This is a breaking change that could cause deployment failures if the image doesn't exist at the new path. Please verify that the image repository at "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests" exists and contains the expected architecture-specific tags (e.g., "amd64", "arm64").

Copilot uses AI. Check for mistakes.
containers:
- name: nccl-launcher
image: telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest
image: telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container image path has been changed from "telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest" to "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}", removing the "azure/" prefix. This is a breaking change that could cause deployment failures if the image doesn't exist at the new path. Please verify that the image repository at "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests" exists and contains the expected architecture-specific tags (e.g., "amd64", "arm64").

Copilot uses AI. Check for mistakes.
containers:
- name: nccl-worker
image: telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest
image: telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container image path has been changed from "telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest" to "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}", removing the "azure/" prefix. This is a breaking change that could cause deployment failures if the image doesn't exist at the new path. Please verify that the image repository at "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests" exists and contains the expected architecture-specific tags (e.g., "amd64", "arm64").

Copilot uses AI. Check for mistakes.
containers:
- name: nccl-worker
image: telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest
image: telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container image path has been changed from "telescope.azurecr.io/azure/aks-rdma-infiniband/nccl-tests:latest" to "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests:{{nccl_tests_version}}", removing the "azure/" prefix. This is a breaking change that could cause deployment failures if the image doesn't exist at the new path. Please verify that the image repository at "telescope.azurecr.io/aks-rdma-infiniband/nccl-tests" exists and contains the expected architecture-specific tags (e.g., "amd64", "arm64").

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants