Instructions to use Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward with PEFT:
Task type is invalid.
- Diffusers
How to use Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-3.5-medium", dtype=torch.bfloat16, device_map="cuda") pipe.load_lora_weights("Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-3.5-medium", dtype=torch.bfloat16, device_map="cuda")
pipe.load_lora_weights("Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string
SD3.5 GenEval2 Multi-Reward (Flow-DPPO)
A LoRA adapter for stabilityai/stable-diffusion-3.5-medium, fine-tuned with Flow-DPPO on GenEval2 in the multi-reward setting, where multiple rewards are aggregated (GDPO with equal reward weights) to improve compositional alignment while mitigating reward hacking and catastrophic forgetting.
Flow-DPPO
Flow-DPPO (Flow Divergence Proximal Policy Optimization) is an online reinforcement learning method for flow-matching image/video generators. Methods such as Flow-GRPO and Flow-CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. Flow-DPPO argues that ratio clipping is a noisy, single-sample proxy for the true policy divergence, which over-constrains some parts of the trajectory and under-constrains others.
Because the per-step policy of a flow model is Gaussian, the KL divergence between the old and new policies can be computed exactly and cheaply. Flow-DPPO replaces ratio clipping with a divergence-proximal constraint, implemented as an asymmetric divergence mask: a gradient update is blocked only when (1) the advantage and ratio indicate the update is moving the policy away from the old policy, and (2) the exact KL already exceeds a threshold. Updates that move back toward the old policy are never blocked, accelerating recovery from overshooting.
This yields higher reward, better KL-proximal efficiency, stronger robustness to catastrophic forgetting, balanced multi-objective optimization, and stable multi-epoch training.
- Paper: https://huggingface.co/papers/2606.11025
- Code: https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO
- Trained with the Flow-Factory RL framework.
Usage
import torch
from diffusers import StableDiffusion3Pipeline
from peft import PeftModel
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3.5-medium",
torch_dtype=torch.bfloat16,
)
# Load the Flow-DPPO LoRA adapter
pipe.transformer = PeftModel.from_pretrained(
pipe.transformer,
"Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward",
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload() # remove and call pipe.to("cuda") if you have enough VRAM
prompt = "four white cats are behind a red bagel"
image = pipe(
prompt,
height=1024,
width=1024,
guidance_scale=4.5,
num_inference_steps=40,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0),
).images[0]
image.save("output.png")
Citation
@article{ping2026flowdppo,
title={Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models},
author={Ping, Bowen and Zhou, Xiangxin and Qi, Penghui and Luo, Minnan and Bo, Liefeng and Pang, Tianyu},
journal={arXiv preprint arXiv:2606.11025},
year={2026}
}
Framework versions
- PEFT 0.19.1
- Downloads last month
- 1
Model tree for Tencent-Hunyuan-Multimodal-RL/SD3.5-GenEval2-Multi-Reward
Base model
stabilityai/stable-diffusion-3.5-medium