Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

SD3.5 GenEval2 Multi-Reward (Flow-DPPO)

A LoRA adapter for stabilityai/stable-diffusion-3.5-medium, fine-tuned with Flow-DPPO on GenEval2 in the multi-reward setting, where multiple rewards are aggregated (GDPO with equal reward weights) to improve compositional alignment while mitigating reward hacking and catastrophic forgetting.

Flow-DPPO

Flow-DPPO (Flow Divergence Proximal Policy Optimization) is an online reinforcement learning method for flow-matching image/video generators. Methods such as Flow-GRPO and Flow-CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. Flow-DPPO argues that ratio clipping is a noisy, single-sample proxy for the true policy divergence, which over-constrains some parts of the trajectory and under-constrains others.

Because the per-step policy of a flow model is Gaussian, the KL divergence between the old and new policies can be computed exactly and cheaply. Flow-DPPO replaces ratio clipping with a divergence-proximal constraint, implemented as an asymmetric divergence mask: a gradient update is blocked only when (1) the advantage and ratio indicate the update is moving the policy away from the old policy, and (2) the exact KL already exceeds a threshold. Updates that move back toward the old policy are never blocked, accelerating recovery from overshooting.

This yields higher reward, better KL-proximal efficiency, stronger robustness to catastrophic forgetting, balanced multi-objective optimization, and stable multi-epoch training.

Paper: https://huggingface.co/papers/2606.11025
Code: https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO
Trained with the Flow-Factory RL framework.

Usage

import torch
from diffusers import StableDiffusion3Pipeline
from peft import PeftModel

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-medium",
    torch_dtype=torch.bfloat16,
)

# Load the Flow-DPPO LoRA adapter
pipe.transformer = PeftModel.from_pretrained(
    pipe.transformer,
    "Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward",
    torch_dtype=torch.bfloat16,
)

pipe.enable_model_cpu_offload()  # remove and call pipe.to("cuda") if you have enough VRAM

prompt = "four white cats are behind a red bagel"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=4.5,
    num_inference_steps=40,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0),
).images[0]
image.save("output.png")

Citation

@article{ping2026flowdppo,
  title={Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models},
  author={Ping, Bowen and Zhou, Xiangxin and Qi, Penghui and Luo, Minnan and Bo, Liefeng and Pang, Tianyu},
  journal={arXiv preprint arXiv:2606.11025},
  year={2026}
}