Training torchtitan by sukesh-amd · Pull Request #191 · ROCm/cvs

sukesh-amd · 2026-05-29T14:17:47Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

- torchtitan_training_lib.py: TorchTitanTrainingJob class for single-node and multi-node distributed training using torchrun + TOML config - Test files: Llama 3.1 8B and 70B x single-node and multi-node - Config templates: single-node and distributed JSON with model params for mi300x and mi325; single-node nnodes hardcoded to 1 - TORCHTITAN_TEST_GUIDE.md: full reference for running and extending tests Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

- CLAUDE.md: restructure for editing-context (Conventions, workload-class lifecycle, env vars, "Where to look when..."); drop generic defaults already enforced by ruff; remove TORCHTITAN_TEST_GUIDE reference. - torchtitan_training_lib.py: simplify tt_config to {module}_{model_size}; use self.master_address in single-node torchrun rdzv_endpoint; collapse duplicate single-node docker exec loop. - Delete TORCHTITAN_TEST_GUIDE.md. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Not required for pytest collection (megatron/ and jax/ siblings work without it) nor for sdist packaging (MANIFEST.in uses recursive-include). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

- Add config-driven TOML generation: use_generated_config=True writes a TOML from model_params to {scripts_dir}/run_config.toml; False falls back to the canned TOML shipped with TorchTitan (path now includes .toml extension). - Parameterize torchtitan_root (default /workspace/torchtitan); 7 new per-model fields (hf_assets_path, converters, dataset, lr, warmup_steps, enable_async_tensor_parallel, precompute_float8_dynamic_scale_for_fsdp) populate the generated TOML. - Inline config_file_path into torchrun instead of exporting \$CONFIG_FILE: the wrapper-script echo was expanding \$CONFIG_FILE against the SSH session env at echo-time, silently overriding our export with a stale /tmp path. - Wrap download + sleep + torchrun chain in nohup sh -c '...' & disown with stdio detached so it survives docker exec returning. - Truncate training.log before download/sleep window so the poller doesn't scan stale prior-run output during the launch gap; bump pre-poll wait from 80s to 300s to clear download + sleep 200. - Align with megatron reference: pattern-table-driven regex parsing (TRAINING_RESULT_PATTERNS, TRAINING_PROGRESS_PATTERNS_TEMPLATE, TRAINING_NAN_PATTERNS) + helpers, configurable hca_id_pattern with re.escape per segment, fuller docstrings on lifecycle methods. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Switch the shipped templates to the canned-TOML path so customers get the upstream TorchTitan preset out of the box. Customers who want full JSON-driven config can flip this to "True". Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

cijohnson

if CLAUDE.md is accidently added to the repo please remove.

* Add CLAUDE.md with codebase guidance for Claude Code Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Update CLAUDE.md with project structure and full lib/ listing Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Add TorchTitan training tests, lib, config templates, and guide - torchtitan_training_lib.py: TorchTitanTrainingJob class for single-node and multi-node distributed training using torchrun + TOML config - Test files: Llama 3.1 8B and 70B x single-node and multi-node - Config templates: single-node and distributed JSON with model params for mi300x and mi325; single-node nnodes hardcoded to 1 - TORCHTITAN_TEST_GUIDE.md: full reference for running and extending tests Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Rewrite CLAUDE.md as Claude-focused guidance; minor torchtitan cleanup - CLAUDE.md: restructure for editing-context (Conventions, workload-class lifecycle, env vars, "Where to look when..."); drop generic defaults already enforced by ruff; remove TORCHTITAN_TEST_GUIDE reference. - torchtitan_training_lib.py: simplify tt_config to {module}_{model_size}; use self.master_address in single-node torchrun rdzv_endpoint; collapse duplicate single-node docker exec loop. - Delete TORCHTITAN_TEST_GUIDE.md. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Remove empty __init__.py from tests/training/torchtitan Not required for pytest collection (megatron/ and jax/ siblings work without it) nor for sdist packaging (MANIFEST.in uses recursive-include). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * torchtitan: generated TOML, config-driven root, robust launcher - Add config-driven TOML generation: use_generated_config=True writes a TOML from model_params to {scripts_dir}/run_config.toml; False falls back to the canned TOML shipped with TorchTitan (path now includes .toml extension). - Parameterize torchtitan_root (default /workspace/torchtitan); 7 new per-model fields (hf_assets_path, converters, dataset, lr, warmup_steps, enable_async_tensor_parallel, precompute_float8_dynamic_scale_for_fsdp) populate the generated TOML. - Inline config_file_path into torchrun instead of exporting \$CONFIG_FILE: the wrapper-script echo was expanding \$CONFIG_FILE against the SSH session env at echo-time, silently overriding our export with a stale /tmp path. - Wrap download + sleep + torchrun chain in nohup sh -c '...' & disown with stdio detached so it survives docker exec returning. - Truncate training.log before download/sleep window so the poller doesn't scan stale prior-run output during the launch gap; bump pre-poll wait from 80s to 300s to clear download + sleep 200. - Align with megatron reference: pattern-table-driven regex parsing (TRAINING_RESULT_PATTERNS, TRAINING_PROGRESS_PATTERNS_TEMPLATE, TRAINING_NAN_PATTERNS) + helpers, configurable hca_id_pattern with re.escape per segment, fuller docstrings on lifecycle methods. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * torchtitan configs: default use_generated_config to False Switch the shipped templates to the canned-TOML path so customers get the upstream TorchTitan preset out of the box. Customers who want full JSON-driven config can flip this to "True". Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * fmt: auto-format torchtitan_training_lib.py Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * remove unused TorchTitanLlamaTrainingJob alias Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * make fmt changes * torchtitan: add qwen3_32b tests, sync HF download, log-format fixes Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * adding torchtitan deepseek_16b models test files and config files * removed CLAUDE.md file --------- Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com> Co-authored-by: Rajesh Thummala <rthummal@amd.com> Co-authored-by: Rajesh Thummala <rthummal_amdeng>

sukesh-amd and others added 8 commits May 28, 2026 13:49

Add CLAUDE.md with codebase guidance for Claude Code

5515827

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Update CLAUDE.md with project structure and full lib/ listing

f9ce690

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Remove empty __init__.py from tests/training/torchtitan

e5610cd

Not required for pytest collection (megatron/ and jax/ siblings work without it) nor for sdist packaging (MANIFEST.in uses recursive-include). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Merge branch 'main' of https://github.com/ROCm/cvs into claude-doc

0be7c88

sukesh-amd requested a review from cijohnson May 29, 2026 14:17

sukesh-amd and others added 6 commits May 29, 2026 20:32

fmt: auto-format torchtitan_training_lib.py

9590560

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

remove unused TorchTitanLlamaTrainingJob alias

4098d0f

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

make fmt changes

092db21

torchtitan: add qwen3_32b tests, sync HF download, log-format fixes

ca06e52

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into training_torchtitan

569986d

adding torchtitan deepseek_16b models test files and config files

0e870c7

sukesh-amd requested review from amd-droy and solaiys June 10, 2026 15:54

cijohnson approved these changes Jun 19, 2026

View reviewed changes

removed CLAUDE.md file

f3ca051

amd-rthummal merged commit 58b618a into main Jun 23, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training torchtitan#191

Training torchtitan#191
amd-rthummal merged 15 commits into
mainfrom
training_torchtitan

sukesh-amd commented May 29, 2026

Uh oh!

cijohnson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sukesh-amd commented May 29, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

cijohnson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants