Skip to content

Update OG2 README convergence metrics and image, fix DATASET.md path …#1534

Open
savitha-eng wants to merge 6 commits intomainfrom
savitha/update-og2-readme-metrics
Open

Update OG2 README convergence metrics and image, fix DATASET.md path …#1534
savitha-eng wants to merge 6 commits intomainfrom
savitha/update-og2-readme-metrics

Conversation

@savitha-eng
Copy link
Copy Markdown
Collaborator

…example

Update convergence benchmarks table with final training results (train loss 0.9444, test CE loss 0.9204, test perplexity 2.51), replace convergence plot with updated curve, and add missing data_files: null to DATASET.md custom sharded parquet config example.

Description

Usage

TODO: Add code snippet

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

…example

Update convergence benchmarks table with final training results (train loss
0.9444, test CE loss 0.9204, test perplexity 2.51), replace convergence plot
with updated curve, and add missing data_files: null to DATASET.md custom
sharded parquet config example.

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 23, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: df2fc2ea-be47-43d1-a913-6e883bf2f253

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch savitha/update-og2-readme-metrics

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: savitha-eng <savithas@nvidia.com>
savitha-eng and others added 4 commits March 23, 2026 23:33
Metric names are train/loss and train/learning_rate (not train_loss).
Added graceful handling for empty history and missing runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace broken fp8_first_last_bf16 mechanism with resolve_layer_precision()
from PR #1500. The old approach set config attributes that were never read by
the forward pass, causing all layers to default to FP8 regardless of setting.

Key changes:
- Delete fp8_debugging.py, add quantization.py with resolve_layer_precision()
  and initialize_quant_stats_logging()
- Add set_recipes()/get_layer_autocast() to OG2 model (from lepton branch),
  model now handles per-layer autocast internally
- Model constructor accepts fp8_recipe/fp4_recipe, set_recipes() called after
  FSDP wrapping since recipes aren't serializable
- Remove outer te.autocast() from training loop (model handles it)
- Rename fp8_stats_config -> quant_stats_config throughout
- Add _parse_layers_cfg() for CLI string support
- Add og2_7b_fp8_fl1_pq2.yaml with explicit fp8_layers=[2..31]
- Expand fp8_debugging_stats.yaml with all layer types + LogTensorStats

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- og2_7b_fp8_fl2_pq2: layers 1,2,31,32 in BF16, layers 3-30 in FP8
- og2_7b_fp8_mid1_pq2: only layer 16 in FP8, all others BF16
- Fix og2_7b_fp8_fl1_pq2: add parquet2 dataset override (data_files=null)

All configs use path=/data/opengenome2/parquet2 with data_files=null,
unique checkpoint dirs, and unique wandb names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant