This folder contains examples of reward model (RM) training and Direct Preference
Optimization (DPO) training on the Anthropic/hh-rlhf dataset. Both approaches
consume the same chosen/rejected preference pairs but differ in how they use the
preference signal.
| Method | Output | Use Case |
|---|---|---|
| RM | A scalar reward model | Scores responses; used as reward in PPO/GRPO/RL |
| DPO | A preference-aligned policy | Directly aligned model, no separate RM needed |
Reward modeling is a crucial step in aligning language models with human preferences. The goal is to train a model that scores responses based on human preferences, which can then guide policy optimization in reinforcement learning.
The Bradley-Terry reward modeling loss can be expressed as
Launch the reward model training process using the provided configuration.
python3 examples/alignment/hhrlhf_rw.py \
--config examples/alignment/hhrlhf_rw.yaml \
experiment_name=hhrlhf-rw \
trial_name=trial1 \
actor.path=Qwen/Qwen2.5-7B \
train_dataset.path=Anthropic/hh-rlhf \
valid_dataset.path=Anthropic/hh-rlhf \
scheduler.type=local \
stats_logger.wandb.mode=online # Set to 'disabled' if you don't use Weights & BiasesDirect Preference Optimization (Rafailov et al., 2023) aligns a language model to human
preferences without training a separate reward model. Instead, it directly optimizes
the policy to assign higher probability to human-preferred (chosen) responses than to
rejected (rejected) ones, using a closed-form contrastive loss on log-probability
ratios between the trainable policy and a frozen reference model.
The DPO loss is defined as
$$ \mathcal{L}{\mathrm{DPO}}(\pi\theta; \pi_{\mathrm{ref}}) = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma!\left( \beta \log \frac{\pi\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)}
- \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right) \right] $$
where
AReaL's DPO implementation computes ref: field in YAML),
mirroring the ref-model pattern used by PPO and GRPO. No pre-computed reference logprobs
need to be stored on disk.
python3 examples/alignment/hhrlhf_dpo.py \
--config examples/alignment/hhrlhf_dpo.yaml \
experiment_name=hhrlhf-dpo \
trial_name=trial1 \
actor.path=Qwen/Qwen2.5-7B \
ref.path=Qwen/Qwen2.5-7B \
train_dataset.path=Anthropic/hh-rlhf \
valid_dataset.path=Anthropic/hh-rlhf \
scheduler.type=local \
stats_logger.wandb.mode=online # Set to 'disabled' if you don't use Weights & BiasesSupported loss_type values: sigmoid (original DPO, Rafailov et al. 2023) and ipo
(Azar et al. 2023, per-token-averaged squared-loss variant).
For best alignment quality, the recommended pipeline is Base → SFT → DPO, using the SFT checkpoint for both the actor and the reference model. The example above trains DPO directly from the base model to keep the verification setup minimal.
Training Qwen2.5-7B-Base on Anthropic/hh-rlhf for one epoch (no SFT warmup) reproduces
the canonical DPO signature from the original paper: loss starts near reward_accuracy rises from 0.50 to ~0.70,
reward_margin grows monotonically, and rejected_reward decreases faster than
chosen_reward.


