Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
Reward-Tilted DMD Β· Ambient-Consistent Distillation Β· Hybrid Policy Gradient
Yushi Huang1, 2,*β , Xiangxin Zhou2,*, Ruoyu Wang2, 3,*β , Chi Zhang3, Jun Zhang1, Tianyu Pang2,β‘
1The Hong Kong University of Science and Technology 2Tencent Hunyuan 3Westlake University
* Equal contribution Β· β Work done during internship at Tencent Hunyuan Β· β‘ Corresponding author
π Abstract
We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution-matching distillation with reward-guided RL for few-step flow generators. Minimizing the KL divergence to a reward-tilted teacher distribution decomposes naturally into a distribution-matching term and a reward-maximization term β instantiated as Ambient-Consistent DMD (AC-DMD) for the cold start and a hybrid policy gradient (SubGRPO + final-step reward back-propagation) for the RL stage. With 4 NFE RTDMD reaches new SOTA on SD3-M / SD3.5-M / FLUX.2 4B; the distilled FLUX.2 4B even beats the full FLUX.2 9B teacher (50 NFE) on most rewards.
4-step samples from RTDMD-distilled FLUX.2 4B (no classifier-free guidance). |
Qualitative comparison for few-step diffusion models (4 NFE). |
π Method Overview
RTDMD overview. Det. = deterministic final step, Stoc. = stochastic intermediate steps. Trajectories: teacher (blue), few-step generator (green), fake score (yellow).
For the generator $G_\theta$, the reward-tilted KL objective decomposes as
The two terms map directly to the two trainers exposed by the CLI:
| Stage | Trainer | Key knobs |
|---|---|---|
| 1. AC-DMD cold start | ACDMDTrainer (--trainer ac_dmd) |
sub-interval renoising, consistency weight Ξ³, CPS sampler Ξ· = 0.9 |
| 2. RTDMD RL fine-tune | RTDMDTrainer (--trainer rtdmd) |
SubGRPO + final-step BP + AC-DMD |
π¦ Contents
This repository hosts the 4-NFE LoRA checkpoints distilled from Stable Diffusion 3.5 Medium with RTDMD.
.
βββ cold_start/
β βββ generator_ema.pt # Stage-1 AC-DMD LoRA (4 NFE base)
βββ rtdmd/
βββ generator_ema.pt # Stage-2 RTDMD LoRA (stacked on top of cold_start)
Each generator_ema.pt is a torch.save-d state_dict containing only LoRA
adapter keys (lora_A / lora_B, rank 32, alpha 64). The two adapters
are designed to be stacked: the cold-start LoRA distills SD3.5-M down to
4 NFE, and the RTDMD LoRA further fine-tunes that distilled model with
reward-tilted RL.
π Usage
Option 1 β RTDMD inference CLI (recommended)
The simplest path is to clone the RTDMD repo and let it stack both LoRAs and run the CPS sampler for you:
git clone https://github.com/Harahan/RTDMD.git && cd RTDMD
pip install -r requirements.txt && pip install -e .
# Download this repo
huggingface-cli download Harahan/SD35M-RTDMD --local-dir ./ckpts/sd35m
# Run 4-NFE inference (single GPU)
python inference.py configs/inference/sd35m.yaml \
--override lora_paths='["./ckpts/sd35m/cold_start/generator_ema.pt","./ckpts/sd35m/rtdmd/generator_ema.pt"]' \
--override eval_reward=false \
--prompt "a cute cat sitting on a windowsill"
Option 2 β Plain diffusers
import torch
from diffusers import StableDiffusion3Pipeline
from peft import LoraConfig
from huggingface_hub import hf_hub_download
base = "stabilityai/stable-diffusion-3.5-medium"
pipe = StableDiffusion3Pipeline.from_pretrained(base, torch_dtype=torch.bfloat16).to("cuda")
# Inject LoRA adapters with the rank/alpha used during training
TARGETS = [
"to_q", "to_k", "to_v", "to_out.0",
"add_q_proj", "add_k_proj", "add_v_proj", "to_add_out",
]
pipe.transformer.add_adapter(
LoraConfig(r=32, lora_alpha=64, target_modules=TARGETS, init_lora_weights="gaussian")
)
# Sequentially load cold-start then RTDMD weights into the same adapter
for ckpt in ["cold_start/generator_ema.pt", "rtdmd/generator_ema.pt"]:
path = hf_hub_download("Harahan/SD35M-RTDMD", ckpt)
state = torch.load(path, map_location="cpu", weights_only=False)
pipe.transformer.load_state_dict(state, strict=False)
# 4-step CPS sampling
pipe(prompt="a cute cat sitting on a windowsill",
num_inference_steps=4, guidance_scale=1.0).images[0].save("out.png")
Note: RTDMD is trained on the CPS (Coefficients-Preserving Sampling) scheduler with
Ξ· = 0.9. Using the default Flow-Matching Euler scheduler will still produce reasonable samples at 4 NFE, but the RTDMD inference CLI is the only entry point that reproduces the paper numbers exactly.
π Citation
@misc{huang2026reinforcingfewstepgeneratorsrewardtilted,
title={Reinforcing Few-step Generators via Reward-Tilted Distribution Matching},
author={Yushi Huang and Xiangxin Zhou and Ruoyu Wang and Chi Zhang and Jun Zhang and Tianyu Pang},
year={2026},
eprint={2605.26108},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.26108},
}
βοΈ License
Apache 2.0 β same as the upstream
RTDMD repo. The base model
stabilityai/stable-diffusion-3.5-medium
is governed by its own license; please review and comply with it separately.
Model tree for Harahan/SD35M-RTDMD
Base model
stabilityai/stable-diffusion-3.5-medium