Skip to content

Training torchtitan#191

Merged
amd-rthummal merged 15 commits into
mainfrom
training_torchtitan
Jun 23, 2026
Merged

Training torchtitan#191
amd-rthummal merged 15 commits into
mainfrom
training_torchtitan

Conversation

@sukesh-amd

Copy link
Copy Markdown
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

sukesh-amd and others added 8 commits May 28, 2026 13:49
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- torchtitan_training_lib.py: TorchTitanTrainingJob class for single-node
  and multi-node distributed training using torchrun + TOML config
- Test files: Llama 3.1 8B and 70B x single-node and multi-node
- Config templates: single-node and distributed JSON with model params
  for mi300x and mi325; single-node nnodes hardcoded to 1
- TORCHTITAN_TEST_GUIDE.md: full reference for running and extending tests

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- CLAUDE.md: restructure for editing-context (Conventions, workload-class
  lifecycle, env vars, "Where to look when..."); drop generic defaults
  already enforced by ruff; remove TORCHTITAN_TEST_GUIDE reference.
- torchtitan_training_lib.py: simplify tt_config to {module}_{model_size};
  use self.master_address in single-node torchrun rdzv_endpoint;
  collapse duplicate single-node docker exec loop.
- Delete TORCHTITAN_TEST_GUIDE.md.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Not required for pytest collection (megatron/ and jax/ siblings work
without it) nor for sdist packaging (MANIFEST.in uses recursive-include).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Add config-driven TOML generation: use_generated_config=True writes a
  TOML from model_params to {scripts_dir}/run_config.toml; False falls
  back to the canned TOML shipped with TorchTitan (path now includes
  .toml extension).
- Parameterize torchtitan_root (default /workspace/torchtitan); 7 new
  per-model fields (hf_assets_path, converters, dataset, lr,
  warmup_steps, enable_async_tensor_parallel,
  precompute_float8_dynamic_scale_for_fsdp) populate the generated TOML.
- Inline config_file_path into torchrun instead of exporting
  \$CONFIG_FILE: the wrapper-script echo was expanding \$CONFIG_FILE
  against the SSH session env at echo-time, silently overriding our
  export with a stale /tmp path.
- Wrap download + sleep + torchrun chain in nohup sh -c '...' & disown
  with stdio detached so it survives docker exec returning.
- Truncate training.log before download/sleep window so the poller
  doesn't scan stale prior-run output during the launch gap; bump
  pre-poll wait from 80s to 300s to clear download + sleep 200.
- Align with megatron reference: pattern-table-driven regex parsing
  (TRAINING_RESULT_PATTERNS, TRAINING_PROGRESS_PATTERNS_TEMPLATE,
  TRAINING_NAN_PATTERNS) + helpers, configurable hca_id_pattern with
  re.escape per segment, fuller docstrings on lifecycle methods.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Switch the shipped templates to the canned-TOML path so customers get
the upstream TorchTitan preset out of the box. Customers who want full
JSON-driven config can flip this to "True".

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@sukesh-amd sukesh-amd requested a review from cijohnson May 29, 2026 14:17
@sukesh-amd sukesh-amd requested review from amd-droy and solaiys June 10, 2026 15:54

@cijohnson cijohnson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if CLAUDE.md is accidently added to the repo please remove.

@amd-rthummal amd-rthummal merged commit 58b618a into main Jun 23, 2026
2 checks passed
sarachoi-amd pushed a commit that referenced this pull request Jun 26, 2026
* Add CLAUDE.md with codebase guidance for Claude Code

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Update CLAUDE.md with project structure and full lib/ listing

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Add TorchTitan training tests, lib, config templates, and guide

- torchtitan_training_lib.py: TorchTitanTrainingJob class for single-node
  and multi-node distributed training using torchrun + TOML config
- Test files: Llama 3.1 8B and 70B x single-node and multi-node
- Config templates: single-node and distributed JSON with model params
  for mi300x and mi325; single-node nnodes hardcoded to 1
- TORCHTITAN_TEST_GUIDE.md: full reference for running and extending tests

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Rewrite CLAUDE.md as Claude-focused guidance; minor torchtitan cleanup

- CLAUDE.md: restructure for editing-context (Conventions, workload-class
  lifecycle, env vars, "Where to look when..."); drop generic defaults
  already enforced by ruff; remove TORCHTITAN_TEST_GUIDE reference.
- torchtitan_training_lib.py: simplify tt_config to {module}_{model_size};
  use self.master_address in single-node torchrun rdzv_endpoint;
  collapse duplicate single-node docker exec loop.
- Delete TORCHTITAN_TEST_GUIDE.md.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* Remove empty __init__.py from tests/training/torchtitan

Not required for pytest collection (megatron/ and jax/ siblings work
without it) nor for sdist packaging (MANIFEST.in uses recursive-include).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* torchtitan: generated TOML, config-driven root, robust launcher

- Add config-driven TOML generation: use_generated_config=True writes a
  TOML from model_params to {scripts_dir}/run_config.toml; False falls
  back to the canned TOML shipped with TorchTitan (path now includes
  .toml extension).
- Parameterize torchtitan_root (default /workspace/torchtitan); 7 new
  per-model fields (hf_assets_path, converters, dataset, lr,
  warmup_steps, enable_async_tensor_parallel,
  precompute_float8_dynamic_scale_for_fsdp) populate the generated TOML.
- Inline config_file_path into torchrun instead of exporting
  \$CONFIG_FILE: the wrapper-script echo was expanding \$CONFIG_FILE
  against the SSH session env at echo-time, silently overriding our
  export with a stale /tmp path.
- Wrap download + sleep + torchrun chain in nohup sh -c '...' & disown
  with stdio detached so it survives docker exec returning.
- Truncate training.log before download/sleep window so the poller
  doesn't scan stale prior-run output during the launch gap; bump
  pre-poll wait from 80s to 300s to clear download + sleep 200.
- Align with megatron reference: pattern-table-driven regex parsing
  (TRAINING_RESULT_PATTERNS, TRAINING_PROGRESS_PATTERNS_TEMPLATE,
  TRAINING_NAN_PATTERNS) + helpers, configurable hca_id_pattern with
  re.escape per segment, fuller docstrings on lifecycle methods.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* torchtitan configs: default use_generated_config to False

Switch the shipped templates to the canned-TOML path so customers get
the upstream TorchTitan preset out of the box. Customers who want full
JSON-driven config can flip this to "True".

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fmt: auto-format torchtitan_training_lib.py

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* remove unused TorchTitanLlamaTrainingJob alias

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* make fmt changes

* torchtitan: add qwen3_32b tests, sync HF download, log-format fixes

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* adding torchtitan deepseek_16b models test files and config files

* removed CLAUDE.md file

---------

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: Rajesh Thummala <rthummal@amd.com>
Co-authored-by: Rajesh Thummala <rthummal_amdeng>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants