Official implementation of PCPO, a novel reinforcement learning approach for aligning diffusion/flow models with human preferences.
PCPO (Proportionate Credit Policy Optimization) improves upon GRPO by using (1) log-hinge loss, and (2) proportionate credit assignment.
This repository contains implementations for Stable Diffusion (SD 1.5, SD 3.5) and FLUX.
PCPO builds upon and extends several foundational works: DDPO, DanceGRPO, and Flow-GRPO.
- ✅ Various Backbones: Train SD1.5 (ddpo, dancegrpo), FLUX (dancegrpo), and SD3.5-M (flowgrpo)
- ✅ Various Reward Models: Support for Aesthetic Score & BERTScore (ddpo), HPSv2.1, CLIPScore (dancegrpo), PickScore, OCR (flowgrpo)
- ✅ Efficient Training: Preprocess SD3.5-M embeddings beforehand, so that training can run on GPUs with 24GB VRAM
- Python 3.10 (recommended)
- CUDA 12.6+ (recommended)
- GPUs with 40GB+ VRAM (full fine-tuning) or 24GB+ VRAM (LoRA fine-tuning)
- dancegrpo requires 8 x 40GB GPUs for full fine-tuning (SD1.x, FLUX), or 8 x 24GB GPUs for LoRA fine-tuning (FLUX).
- ddpo, flowgrpo can run on 1 x 24GB GPU.
Clone the repository:
git clone https://github.com/jaylee2000/pcpo.git
cd pcpoEach implementation has its own configuration and training procedures. Please refer to the respective README files for detailed instructions:
- SD1.5 / FLUX (DanceGRPO): See
dancegrpo/README.mdfor configuration, checkpoint downloads, and training scripts - SD3.5-M (Flow-GRPO): See
flowgrpo/README.mdfor embedding preprocessing and training with PCPO/GRPO - DDPO Baseline: See
ddpo/ddpo-main/README.mdfor DDPO training and configuration
pcpo/
├── dancegrpo/ # DanceGRPO-based (SD1.x & FLUX) implementations
│ ├── fastvideo/ # Core training and inference code
│ ├── scripts/ # Training and preprocessing scripts
│ └── assets/ # Prompts and datasets
├── flowgrpo/ # FlowGRPO-based (SD3.5) implementation
│ ├── flow_grpo/ # Core implementation
│ ├── config/ # Training configurations
│ ├── scripts/ # Training and preprocessing scripts
│ └── dataset/ # Dataset utilities
└── ddpo/
├── ddpo-main/ # DDPO-based (SD1.x) implementation
│ ├── diffusion/ # Core implementation
│ ├── diffusion_doublerg/ # Implicit Reward Guidance (IRG) implementation (Appendix F)
│ ├── configs/ # Training and inference configurations
│ ├── utils/ # Helper functions, prompts & rewards
└── qwen-server/ # Server-based BERTScore reward model
If you find this work useful for your research, please cite:
@article{pcpo2025,
title={PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models},
author={Lee, Jeongjae and Ye, Jong Chul},
journal={arXiv preprint arXiv:2509.25774},
year={2025}
}@article{black2023ddpo,
title={Training Diffusion Models with Reinforcement Learning},
author={Black, Kevin and Janner, Michael and Du, Yilun and Kostrikov, Ilya and Levine, Sergey},
journal={arXiv preprint arXiv:2305.13301},
year={2023}
}
@article{xue2025dancegrpo,
title={DanceGRPO: Unleashing GRPO on Visual Generation},
author={Xue, Zeyue and Wu, Jie and Gao, Yu and Kong, Fangyuan and Zhu, Lingting and Chen, Mengzhao and Liu, Zhiheng and Liu, Wei and Guo, Qiushan and Huang, Weilin and others},
journal={arXiv preprint arXiv:2505.07818},
year={2025}
}
@article{liu2025flow,
title={Flow-grpo: Training flow matching models via online rl},
author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},
journal={arXiv preprint arXiv:2505.05470},
year={2025}
}This project is licensed under the Apache License 2.0.
This work builds upon several excellent open-source projects:
For questions and discussions, please open an issue or contact [jaysquirrel2000@gmail.com].
- [2025.12.06]: 🔥 Code made public!
- [2025.11.24]: 🔥 Code released on Github!
- [2025.09.30]: 🔥 Paper released on arXiv!