Update OG2 README convergence metrics and image, fix DATASET.md path … by savitha-eng · Pull Request #1534 · NVIDIA/bionemo-framework

savitha-eng · 2026-03-23T16:56:11Z

…example

Update convergence benchmarks table with final training results (train loss 0.9444, test CE loss 0.9204, test perplexity 2.51), replace convergence plot with updated curve, and add missing data_files: null to DATASET.md custom sharded parquet config example.

Description

Usage

TODO: Add code snippet

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

@coderabbitai review - Triggers a standard review
@coderabbitai full review - Triggers a comprehensive review

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

…example Update convergence benchmarks table with final training results (train loss 0.9444, test CE loss 0.9204, test perplexity 2.51), replace convergence plot with updated curve, and add missing data_files: null to DATASET.md custom sharded parquet config example. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

copy-pr-bot · 2026-03-23T16:56:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-23T16:56:19Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: df2fc2ea-be47-43d1-a913-6e883bf2f253

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch savitha/update-og2-readme-metrics

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: savitha-eng <savithas@nvidia.com>

Metric names are train/loss and train/learning_rate (not train_loss). Added graceful handling for empty history and missing runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace broken fp8_first_last_bf16 mechanism with resolve_layer_precision() from PR #1500. The old approach set config attributes that were never read by the forward pass, causing all layers to default to FP8 regardless of setting. Key changes: - Delete fp8_debugging.py, add quantization.py with resolve_layer_precision() and initialize_quant_stats_logging() - Add set_recipes()/get_layer_autocast() to OG2 model (from lepton branch), model now handles per-layer autocast internally - Model constructor accepts fp8_recipe/fp4_recipe, set_recipes() called after FSDP wrapping since recipes aren't serializable - Remove outer te.autocast() from training loop (model handles it) - Rename fp8_stats_config -> quant_stats_config throughout - Add _parse_layers_cfg() for CLI string support - Add og2_7b_fp8_fl1_pq2.yaml with explicit fp8_layers=[2..31] - Expand fp8_debugging_stats.yaml with all layer types + LogTensorStats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- og2_7b_fp8_fl2_pq2: layers 1,2,31,32 in BF16, layers 3-30 in FP8 - og2_7b_fp8_mid1_pq2: only layer 16 in FP8, all others BF16 - Fix og2_7b_fp8_fl1_pq2: add parquet2 dataset override (data_files=null) All configs use path=/data/opengenome2/parquet2 with data_files=null, unique checkpoint dirs, and unique wandb names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update README.md

22dc044

Signed-off-by: savitha-eng <savithas@nvidia.com>

savitha-eng marked this pull request as ready for review March 23, 2026 16:57

savitha-eng requested review from broland-hat, cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, polinabinder1, pstjohn, trvachov, tshimko-nv and yzhang123 as code owners March 23, 2026 16:57

savitha-eng and others added 4 commits March 23, 2026 23:33

Add FP8 activation underflow analysis and WandB comparison scripts

d28d410

Fix WandB comparison script: correct metric names and add error handling

d357621

Metric names are train/loss and train/learning_rate (not train_loss). Added graceful handling for empty history and missing runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update OG2 README convergence metrics and image, fix DATASET.md path …#1534

Update OG2 README convergence metrics and image, fix DATASET.md path …#1534
savitha-eng wants to merge 6 commits intomainfrom
savitha/update-og2-readme-metrics

savitha-eng commented Mar 23, 2026

Uh oh!

copy-pr-bot bot commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

savitha-eng commented Mar 23, 2026

Description

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Triggering Code Rabbit AI Review

Pre-submit Checklist

Uh oh!

copy-pr-bot bot commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Mar 23, 2026 •

edited

Loading