Update OG2 README convergence metrics and image, fix DATASET.md path …#1534
Open
savitha-eng wants to merge 6 commits intomainfrom
Open
Update OG2 README convergence metrics and image, fix DATASET.md path …#1534savitha-eng wants to merge 6 commits intomainfrom
savitha-eng wants to merge 6 commits intomainfrom
Conversation
…example Update convergence benchmarks table with final training results (train loss 0.9444, test CE loss 0.9204, test perplexity 2.51), replace convergence plot with updated curve, and add missing data_files: null to DATASET.md custom sharded parquet config example. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Contributor
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Signed-off-by: savitha-eng <savithas@nvidia.com>
Metric names are train/loss and train/learning_rate (not train_loss). Added graceful handling for empty history and missing runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace broken fp8_first_last_bf16 mechanism with resolve_layer_precision() from PR #1500. The old approach set config attributes that were never read by the forward pass, causing all layers to default to FP8 regardless of setting. Key changes: - Delete fp8_debugging.py, add quantization.py with resolve_layer_precision() and initialize_quant_stats_logging() - Add set_recipes()/get_layer_autocast() to OG2 model (from lepton branch), model now handles per-layer autocast internally - Model constructor accepts fp8_recipe/fp4_recipe, set_recipes() called after FSDP wrapping since recipes aren't serializable - Remove outer te.autocast() from training loop (model handles it) - Rename fp8_stats_config -> quant_stats_config throughout - Add _parse_layers_cfg() for CLI string support - Add og2_7b_fp8_fl1_pq2.yaml with explicit fp8_layers=[2..31] - Expand fp8_debugging_stats.yaml with all layer types + LogTensorStats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- og2_7b_fp8_fl2_pq2: layers 1,2,31,32 in BF16, layers 3-30 in FP8 - og2_7b_fp8_mid1_pq2: only layer 16 in FP8, all others BF16 - Fix og2_7b_fp8_fl1_pq2: add parquet2 dataset override (data_files=null) All configs use path=/data/opengenome2/parquet2 with data_files=null, unique checkpoint dirs, and unique wandb names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…example
Update convergence benchmarks table with final training results (train loss 0.9444, test CE loss 0.9204, test perplexity 2.51), replace convergence plot with updated curve, and add missing data_files: null to DATASET.md custom sharded parquet config example.
Description
Usage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist