Regression detection for RLHF/finetuning pipelines. Automatically catch performance, stability, and correctness regressions before they waste researcher time.
No local GPU? No problem. Try RLHF Canary with a free T4 GPU in Google Colab (~15 min):
- Run a DPO canary training job
- Save metrics as a baseline
- Compare runs and detect regressions
| Notebook | Description | |
|---|---|---|
| 01_quickstart | Core workflow for detecting regressions | |
| 02_profiler_deep_dive | PyTorch profiler for GPU bottlenecks | |
| 03_stability_monitoring | NaN/loss divergence detection | |
| 04_root_cause_analysis | Heuristics for debugging failures | |
| 05_ppo_canary | PPO-specific training validation | |
| 06_sft_canary | SFT training validation | |
| 07_ci_cd_integration | GitHub Actions & PR gating | |
| 08_configuration_and_thresholds | Custom thresholds & test matrices | |
| 09_quantization_and_memory | 4-bit/8-bit quantization & baselines | |
| vscode_dev | Local VS Code + Colab development |
- Performance regression detection: Track tokens/sec, step time, GPU utilization, memory usage
- Stability monitoring: Detect NaN/Inf values, loss divergence, gradient explosion
- Root cause analysis: Heuristic-based diagnosis of regression causes
- CI/CD integration: GitHub Actions workflow with PR gating
- Flexible configuration: YAML-based configs for smoke, perf, and nightly tests
# Clone the repository
git clone https://github.com/yourusername/rlhf-canary.git
cd rlhf-canary
# Install with pip
pip install -e ".[dev]"
# Or install dependencies directly
pip install -r requirements.txt# Show environment info
canary env
# Run DPO smoke test
canary run configs/dpo_smoke.yaml
# The output will be saved to ./canary_output/<run_id>/metrics.json# Save current run as baseline
canary save-baseline ./canary_output/<run_id>/metrics.json ./baselines/dpo_smoke.json
# Run again and compare
canary run configs/dpo_smoke.yaml
canary compare ./canary_output/<new_run_id>/metrics.json ./baselines/dpo_smoke.jsoncanary --help # Show all commands
canary env # Show environment fingerprint
canary run <config> # Run a canary job
canary compare <current> <base> # Compare metrics to baseline
canary save-baseline <src> <dst> # Save metrics as baseline
canary init-config <path> # Generate sample configCanary jobs are configured via YAML files:
# configs/dpo_smoke.yaml
name: dpo_smoke
description: DPO smoke test
# Model
model_name: EleutherAI/pythia-70m
use_peft: true
lora_r: 16
# Training
training_type: dpo
max_steps: 100
batch_size: 2
gradient_accumulation_steps: 4
# DPO-specific
beta: 0.1
# Dataset
dataset_name: Anthropic/hh-rlhf
dataset_size: 512| Tier | Steps | Duration | Use Case |
|---|---|---|---|
| smoke | 100 | ~5 min | PR gating |
| perf | 500 | ~20 min | Detailed perf analysis |
| nightly | 2000 | ~2 hr | Comprehensive soak |
| Category | Metric | Default Threshold |
|---|---|---|
| Performance | Step time increase | 10% |
| Performance | Tokens/sec drop | 8% |
| Performance | Memory increase | 500MB or 20% |
| Stability | NaN steps | 0 allowed |
| Stability | Inf steps | 0 allowed |
| Stability | Loss divergence | Auto-detected |
# RLHF Canary Report ✅
## Summary
**Status:** PASS
**Baseline Run:** `dpo_smoke_1234_abcd`
**Current Run:** `dpo_smoke_5678_efgh`
## Regression Checks
| Check | Status | Baseline | Current | Change | Threshold |
|-------|--------|----------|---------|--------|-----------|
| nan_steps | ✅ | 0 | 0 | - | 0 |
| step_time_mean | ✅ | 0.4523s | 0.4601s | +1.7% | 10% |
| tokens_per_sec | ✅ | 1847 | 1820 | -1.5% | 8% |
| max_memory | ✅ | 2341MB | 2356MB | +0.6% | 500MB |
When regressions are detected, the canary provides heuristic analysis:
## Root Cause Analysis
**Summary:** Most likely cause: dataloader (Dataloader or preprocessing bottleneck)
### #1 Dataloader (████████░░ 70%)
Dataloader or preprocessing bottleneck
**Evidence:**
- Step time increased by 25.0%
- Large increases often indicate CPU-side bottlenecks
**Suggested Actions:**
- Check dataloader num_workers configuration
- Profile CPU utilization during training
- Check for tokenization changesAdd the workflow to your repo:
# .github/workflows/canary.yml
name: RLHF Canary
on:
pull_request:
branches: [main]
jobs:
canary:
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v4
- run: pip install -e .
- run: canary run configs/dpo_smoke.yaml
- run: canary compare ./canary_output/*/metrics.json ./baselines/baseline.jsonrlhf-canary/
├── canary/
│ ├── cli.py # CLI interface
│ ├── runner/ # Job execution
│ │ ├── base.py # Base runner
│ │ └── local.py # Local runner
│ ├── collect/ # Metrics collection
│ │ ├── metrics.py # Training callbacks
│ │ ├── profiler.py # PyTorch profiler
│ │ └── env_fingerprint.py
│ ├── compare/ # Regression detection
│ │ ├── stats.py # Statistical comparison
│ │ ├── thresholds.py # Configurable thresholds
│ │ └── heuristics.py # Root cause analysis
│ └── report/ # Output generation
│ ├── markdown.py # Markdown reports
│ └── github.py # GitHub integration
├── configs/ # Sample configurations
├── tests/ # Unit tests
└── workflows/ # CI/CD workflows
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=canary --cov-report=html
# Lint
ruff check canary/Run smoke tests on every PR to catch obvious regressions:
canary run configs/dpo_smoke.yaml
canary compare ./current/metrics.json ./baselines/main.json --threshold-tier smokeRun longer tests overnight to catch "slowdown after N steps":
canary run configs/dpo_perf.yaml
canary compare ./current/metrics.json ./baselines/main.json --threshold-tier nightlyWhen changing model architecture, validate training stability:
canary run configs/dpo_perf.yaml
# Check for NaN, loss divergence, memory changesMIT