arxiv:2606.11025

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Published on Jun 9

· Submitted by

Tianyu Pang on Jun 10

Tencent-Hunyuan-Multimodal-RL

Upvote

Authors:

Abstract

Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.

View arXiv page View PDF Project page Add to collection

Community

P2333

Paper submitter about 10 hours ago

•

edited about 10 hours ago

smithcohn12

about 6 hours ago

I've experimented with PPO-based flow model training before, and replacing noisy ratio clipping with exact KL-based divergence constraints seems like a much more stable and efficient approach, especially for multi-objective optimization and longer training runs.
https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO wordle unlimited