Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Jan 17, 2026

When using bf16=True with zero_optimization.stage=0, the optimizer state is not saved or loaded during checkpointing. The optimizer's step counter and other states (exp_avg, exp_avg_sq) are lost after loading a checkpoint.
This PR addresses the issue by fixing a flag indicating the config and adds a test arg to cover the problematic case.

@tohtana tohtana enabled auto-merge (squash) January 17, 2026 18:39
@tohtana tohtana merged commit 991ebd7 into deepspeedai:master Jan 19, 2026
11 checks passed
tohtana added a commit that referenced this pull request Jan 20, 2026
We have been disabled the full unit test workflow for a while. This PR
migrates the full test to our AWS test infra.
To make the tests pass, we need to merge these PRs:
- #7786
- #7788
- #7789
- #7790
- #7793
- #7794

In addition having these PRs merged, this PR has the following changes
in the full test workflow and test harness:
- Ignore flags for some known issues:
- nvme: Requires an actual NVMe device. Our CI currently doesn't have
NVMe storage configured
- GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to
enable direct GPU-to-storage transfers. CI instances don't have this
configured.
- Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation
has pre-existing bugs that cause internal pytest errors and worker
crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses
torch.optim.AdamW which does CUDA graph capture checks that fail in
forked processes (--forked flag, we can just move it to sequential
tests)
- `/mnt/aio` mount for async I/O tests
- CUTLASS installation for Evoformer tests
- Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker
cleanup hangs

Once we merge this PR, we will be able to run the full test manually or
at scheduled times.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants