How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-3.5-medium", dtype=torch.bfloat16, device_map="cuda")
pipe.load_lora_weights("Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

SD3.5 GenEval2 Multi-Reward (Flow-DPPO)

A LoRA adapter for stabilityai/stable-diffusion-3.5-medium, fine-tuned with Flow-DPPO on GenEval2 in the multi-reward setting, where multiple rewards are aggregated (GDPO with equal reward weights) to improve compositional alignment while mitigating reward hacking and catastrophic forgetting.

Flow-DPPO

Flow-DPPO (Flow Divergence Proximal Policy Optimization) is an online reinforcement learning method for flow-matching image/video generators. Methods such as Flow-GRPO and Flow-CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. Flow-DPPO argues that ratio clipping is a noisy, single-sample proxy for the true policy divergence, which over-constrains some parts of the trajectory and under-constrains others.

Because the per-step policy of a flow model is Gaussian, the KL divergence between the old and new policies can be computed exactly and cheaply. Flow-DPPO replaces ratio clipping with a divergence-proximal constraint, implemented as an asymmetric divergence mask: a gradient update is blocked only when (1) the advantage and ratio indicate the update is moving the policy away from the old policy, and (2) the exact KL already exceeds a threshold. Updates that move back toward the old policy are never blocked, accelerating recovery from overshooting.

This yields higher reward, better KL-proximal efficiency, stronger robustness to catastrophic forgetting, balanced multi-objective optimization, and stable multi-epoch training.

Usage

import torch
from diffusers import StableDiffusion3Pipeline
from peft import PeftModel

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-medium",
    torch_dtype=torch.bfloat16,
)

# Load the Flow-DPPO LoRA adapter
pipe.transformer = PeftModel.from_pretrained(
    pipe.transformer,
    "Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward",
    torch_dtype=torch.bfloat16,
)

pipe.enable_model_cpu_offload()  # remove and call pipe.to("cuda") if you have enough VRAM

prompt = "four white cats are behind a red bagel"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=4.5,
    num_inference_steps=40,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0),
).images[0]
image.save("output.png")

Citation

@article{ping2026flowdppo,
  title={Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models},
  author={Ping, Bowen and Zhou, Xiangxin and Qi, Penghui and Luo, Minnan and Bo, Liefeng and Pang, Tianyu},
  journal={arXiv preprint arXiv:2606.11025},
  year={2026}
}

Framework versions

  • PEFT 0.19.1
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward

Adapter
(105)
this model

Collection including Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward

Paper for Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward