Training torchtitan#191
Merged
Merged
Conversation
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- torchtitan_training_lib.py: TorchTitanTrainingJob class for single-node and multi-node distributed training using torchrun + TOML config - Test files: Llama 3.1 8B and 70B x single-node and multi-node - Config templates: single-node and distributed JSON with model params for mi300x and mi325; single-node nnodes hardcoded to 1 - TORCHTITAN_TEST_GUIDE.md: full reference for running and extending tests Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- CLAUDE.md: restructure for editing-context (Conventions, workload-class
lifecycle, env vars, "Where to look when..."); drop generic defaults
already enforced by ruff; remove TORCHTITAN_TEST_GUIDE reference.
- torchtitan_training_lib.py: simplify tt_config to {module}_{model_size};
use self.master_address in single-node torchrun rdzv_endpoint;
collapse duplicate single-node docker exec loop.
- Delete TORCHTITAN_TEST_GUIDE.md.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Not required for pytest collection (megatron/ and jax/ siblings work without it) nor for sdist packaging (MANIFEST.in uses recursive-include). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Add config-driven TOML generation: use_generated_config=True writes a
TOML from model_params to {scripts_dir}/run_config.toml; False falls
back to the canned TOML shipped with TorchTitan (path now includes
.toml extension).
- Parameterize torchtitan_root (default /workspace/torchtitan); 7 new
per-model fields (hf_assets_path, converters, dataset, lr,
warmup_steps, enable_async_tensor_parallel,
precompute_float8_dynamic_scale_for_fsdp) populate the generated TOML.
- Inline config_file_path into torchrun instead of exporting
\$CONFIG_FILE: the wrapper-script echo was expanding \$CONFIG_FILE
against the SSH session env at echo-time, silently overriding our
export with a stale /tmp path.
- Wrap download + sleep + torchrun chain in nohup sh -c '...' & disown
with stdio detached so it survives docker exec returning.
- Truncate training.log before download/sleep window so the poller
doesn't scan stale prior-run output during the launch gap; bump
pre-poll wait from 80s to 300s to clear download + sleep 200.
- Align with megatron reference: pattern-table-driven regex parsing
(TRAINING_RESULT_PATTERNS, TRAINING_PROGRESS_PATTERNS_TEMPLATE,
TRAINING_NAN_PATTERNS) + helpers, configurable hca_id_pattern with
re.escape per segment, fuller docstrings on lifecycle methods.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Switch the shipped templates to the canned-TOML path so customers get the upstream TorchTitan preset out of the box. Customers who want full JSON-driven config can flip this to "True". Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
cijohnson
approved these changes
Jun 19, 2026
cijohnson
left a comment
Collaborator
There was a problem hiding this comment.
if CLAUDE.md is accidently added to the repo please remove.
sarachoi-amd
pushed a commit
that referenced
this pull request
Jun 26, 2026
* Add CLAUDE.md with codebase guidance for Claude Code
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* Update CLAUDE.md with project structure and full lib/ listing
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* Add TorchTitan training tests, lib, config templates, and guide
- torchtitan_training_lib.py: TorchTitanTrainingJob class for single-node
and multi-node distributed training using torchrun + TOML config
- Test files: Llama 3.1 8B and 70B x single-node and multi-node
- Config templates: single-node and distributed JSON with model params
for mi300x and mi325; single-node nnodes hardcoded to 1
- TORCHTITAN_TEST_GUIDE.md: full reference for running and extending tests
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* Rewrite CLAUDE.md as Claude-focused guidance; minor torchtitan cleanup
- CLAUDE.md: restructure for editing-context (Conventions, workload-class
lifecycle, env vars, "Where to look when..."); drop generic defaults
already enforced by ruff; remove TORCHTITAN_TEST_GUIDE reference.
- torchtitan_training_lib.py: simplify tt_config to {module}_{model_size};
use self.master_address in single-node torchrun rdzv_endpoint;
collapse duplicate single-node docker exec loop.
- Delete TORCHTITAN_TEST_GUIDE.md.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Remove empty __init__.py from tests/training/torchtitan
Not required for pytest collection (megatron/ and jax/ siblings work
without it) nor for sdist packaging (MANIFEST.in uses recursive-include).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* torchtitan: generated TOML, config-driven root, robust launcher
- Add config-driven TOML generation: use_generated_config=True writes a
TOML from model_params to {scripts_dir}/run_config.toml; False falls
back to the canned TOML shipped with TorchTitan (path now includes
.toml extension).
- Parameterize torchtitan_root (default /workspace/torchtitan); 7 new
per-model fields (hf_assets_path, converters, dataset, lr,
warmup_steps, enable_async_tensor_parallel,
precompute_float8_dynamic_scale_for_fsdp) populate the generated TOML.
- Inline config_file_path into torchrun instead of exporting
\$CONFIG_FILE: the wrapper-script echo was expanding \$CONFIG_FILE
against the SSH session env at echo-time, silently overriding our
export with a stale /tmp path.
- Wrap download + sleep + torchrun chain in nohup sh -c '...' & disown
with stdio detached so it survives docker exec returning.
- Truncate training.log before download/sleep window so the poller
doesn't scan stale prior-run output during the launch gap; bump
pre-poll wait from 80s to 300s to clear download + sleep 200.
- Align with megatron reference: pattern-table-driven regex parsing
(TRAINING_RESULT_PATTERNS, TRAINING_PROGRESS_PATTERNS_TEMPLATE,
TRAINING_NAN_PATTERNS) + helpers, configurable hca_id_pattern with
re.escape per segment, fuller docstrings on lifecycle methods.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* torchtitan configs: default use_generated_config to False
Switch the shipped templates to the canned-TOML path so customers get
the upstream TorchTitan preset out of the box. Customers who want full
JSON-driven config can flip this to "True".
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fmt: auto-format torchtitan_training_lib.py
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* remove unused TorchTitanLlamaTrainingJob alias
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* make fmt changes
* torchtitan: add qwen3_32b tests, sync HF download, log-format fixes
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* adding torchtitan deepseek_16b models test files and config files
* removed CLAUDE.md file
---------
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: Rajesh Thummala <rthummal@amd.com>
Co-authored-by: Rajesh Thummala <rthummal_amdeng>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist