Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Jan 18, 2026

We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra.
To make the tests pass, we need to merge these PRs:

In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness:

  • Ignore flags for some known issues:
    • nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured
    • GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured.
    • Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests)
  • /mnt/aio mount for async I/O tests
  • CUTLASS installation for Evoformer tests
  • Add DS_DISABLE_REUSE_DIST_ENV to the test harness to prevent worker cleanup hangs

Once we merge this PR, we will be able to run the full test manually or at scheduled times.

Add a new workflow file for running the full DeepSpeed unit test suite
on AWS L40S runners. This includes:
- CUTLASS installation for Evoformer tests
- Full dependencies (transformers, pytest-timeout, etc.)
- DS_DISABLE_REUSE_DIST_ENV to prevent worker cleanup hangs
- /mnt/aio mount for async I/O tests
- Parallel tests (-n 8) and sequential tests
- Ignore flags for known issues (nvme, GDS, zenflow, etc.)

This workflow is separate from aws-torch-latest.yml which runs only V1 tests.

Signed-off-by: Masahiro Tanaka <[email protected]>
Allow disabling reuse_dist_env via environment variable. This is useful
for CI full test runs where reusing the distributed environment can cause
pool worker cleanup to hang after tests complete.

Signed-off-by: Masahiro Tanaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant