File size: 16,999 Bytes
e14f899
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d7075e
e14f899
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
566dc49
e14f899
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aefac7f
e14f899
 
8d7075e
e14f899
 
 
 
 
 
 
8d7075e
e14f899
 
 
 
 
 
566dc49
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
[ไธญๆ–‡ๆ–‡ๆกฃ](./README_CN.md)

# HY-Video-PRFL

<div align="center">
  <img src="assets/logo.svg"  height=100>

# โšก HY-Video-PRFL: Video Generation Models Are Good Latent Reward Models

</div>

Video generation models can both create and evaluate โ€” we enable 14B models to complete full 720Pร—81-frame post-training within 67GB VRAM, achieving 1.5ร— faster speed and 56% improvement in motion quality over traditional methods.


<div align="center">
  <a href="https://github.com/Tencent-Hunyuan/HY-Video-PRFL"><img src="https://img.shields.io/static/v1?label=HY-Video-PRFL%20Code&message=Github&color=blue"></a> &ensp;
  <a href="https://hy-video-prfl.github.io/HY-VIDEO-PRFL/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Web&color=green"></a> &ensp;
  <a href="https://arxiv.org/pdf/2511.21541"><img src="https://img.shields.io/badge/ArXiv-2511.21541-red"></a> &ensp;
  <a href="https://huggingface.co/tencent/HY-Video-PRFL"><img src="https://img.shields.io/badge/๐Ÿค—%20HuggingFace-Model-yellow"></a>
</div>

<br>

![image](assets/teaser.jpg)

> [**HY-Video-PRFL: Video Generation Models Are Good Latent Reward Models**](https://arxiv.org/pdf/2511.21541) <be>

## ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ News!!

* **Dec 07, 2025**: ๐Ÿ‘‹ We release the training and inference code of HY-Video-PRFL.
* **Nov 26, 2025**: ๐Ÿ‘‹ We release the paper and project page. [[Paper](https://arxiv.org/pdf/2511.21541)] [[Project Page](https://hy-video-prfl.github.io/HY-VIDEO-PRFL/)]
## ๐Ÿ“‘ Open-source Plan

- HY-Video-PRFL
  - [x] Training and inference code for PAVRM
  - [x] Training and inference code for PRFL
  
## ๐Ÿ“‹ Table of Contents

- [๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ News!!](#-news)
- [๐Ÿ“‘ Open-source Plan](#-open-source-plan)
- [๐Ÿ“– Abstract](#-abstract)
- [๐Ÿ—๏ธ Model Architecture](#-model-architecture)
- [๐Ÿ“Š Performance](#-performance)
- [๐ŸŽฌ Case Show](#-case-show)
- [๐Ÿ“œ Requirements](#-requirements)
- [๐Ÿ› ๏ธ Installation](#-installation)
- [๐Ÿงฑ Download Models](#-download-models)
- [๐ŸŽ“ Training](#-training)
- [๐Ÿš€ Inference](#-inference)
- [๐Ÿ“ Citation](#-citation)
- [๐Ÿ™ Acknowledgements](#-acknowledgements)



## ๐Ÿ“– Abstract

Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding.

**HY-Video-PRFL** introduces **Process Reward Feedback Learning (PRFL)**, a framework that conducts preference optimization entirely in latent space. We demonstrate that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding.

**Key advantages:**
- โœ… Efficient latent-space optimization
- โœ… Significant memory savings
- โœ… 1.4X faster training compared to RGB ReFL
- โœ… Better alignment with human preferences


## ๐Ÿ—๏ธ Model Architecture

![image](assets/method.png)

**Traditional RGB ReFL** relies on vision-language models designed for pixel-space inputs, requiring expensive VAE decoding and confining optimization to late-stage denoising steps.

**Our PRFL approach** leverages pre-trained video generation models as reward models in the noisy latent space. This enables:
- Full-chain gradient backpropagation without VAE decoding
- Early-stage supervision for motion dynamics and structure coherence
- Substantial reductions in memory consumption and training time



## ๐Ÿ“Š Performance

### Quantitative Results

Our experiments demonstrate that PRFL achieves substantial motion quality improvements (with +56.00 in dynamic degree, +21.52 in human anatomy and superior alignment with human preferences) as well as significant efficiency gains (with at least 1.4X faster training and notable memory savings).

#### Text-to-Video Results
![image](assets/T2V_exp.png)

#### Image-to-Video Results
![image](assets/I2V_exp.png)

#### Efficiency Comparison
<img src="assets/efficiency.png" width="50%">

## ๐ŸŽฌ Case Show

### Text-to-Video Generation

|480P Resolution|720P Resolution|
|---|---|
|<video src="https://github.com/user-attachments/assets/eed8d875-4b0d-43ec-b013-f0d10c2e107a" width="600" controls autoplay loop></video> <details><summary>๐Ÿ“‹ Show prompt</summary>```Two shirtless men with short dark hair are sparring in a dimly lit room. They are both wearing boxing gloves, one red and one black. One man is wearing white shorts while the other is wearing black shorts. There are several screens on the wall displaying images of buildings and people.```</details>|<video src="https://github.com/user-attachments/assets/4871a9f9-9b15-4065-8680-1c8059242707" width="600" controls autoplay loop></video> <details><summary>๐Ÿ“‹ Show prompt</summary>```A woman with fair skin, dark hair tied back, and wearing a light green t-shirt is visible against a gray background. She uses both hands to apply a white substance from below her eyes upward onto her face. Her mouth is slightly open as she spreads the cream.```</details>|
|<video src="https://github.com/user-attachments/assets/27296444-973b-4815-b103-5a2ee06404db" width="600" controls autoplay loop></video> <details><summary>๐Ÿ“‹ Show prompt</summary>```The woman has dark eyes and is holding a black smartphone to her ear with her right hand. She is typing on the keyboard of an open silver laptop computer with her left hand. Her fingers have blue nail polish. She is sitting in front of a window covered by sheer white curtains.```</details>|<video src="https://github.com/user-attachments/assets/430ff1ca-63ae-4a67-b6fb-b010c7ceec29" width="600" controls autoplay loop></video> <details><summary>๐Ÿ“‹ Show prompt</summary>```A light-skinned man with short hair wearing a yellow baseball cap, plaid shirt, and blue overalls stands in a field of sunflowers. He holds a cut sunflower head in his left hand and touches it with his right index finger. Several other sunflowers are visible in the background, some facing away from the camera.```</details>|

### Image-to-Video Generation

|480P Resolution|720P Resolution|
|---|---|
|<img src="assets/videos/more/14_seed_876367.jpg" width="600">|<img src="assets/videos/more/109_seed_677347.jpg" width="600">|
|<video src="https://github.com/user-attachments/assets/956f5f64-1680-45a0-8666-9fda8e253017" width="600" controls autoplay loop></video> <details><summary>๐Ÿ“‹ Show prompt</summary>```A monochromatic video capturing a cat's gaze into the camera```</details>|<video src="https://github.com/user-attachments/assets/8dfeb1ae-8b9c-45aa-899a-3b48903629f9" width="600" controls autoplay loop></video> <details><summary>๐Ÿ“‹ Show prompt</summary>```A young boy is jumping in the mud```</details>|
|<img src="assets/videos/more/artlist_video_60ca5873a5c21ff4e4785b1997239d87_seed_561368.jpg" width="600">|<img src="assets/videos/more/real_1246_seed_277973.jpg" width="600">|
<video src="https://github.com/user-attachments/assets/d6d48d0c-cca5-4bdc-95c7-6a2858679111" width="600" controls autoplay loop></video> <details><summary>๐Ÿ“‹ Show prompt</summary>```A family of four eats fast food at a table.```</details>| <video src="https://github.com/user-attachments/assets/f39d3382-5430-412d-999e-c4c108835b6c" width="600" controls autoplay loop></video> <details><summary>๐Ÿ“‹ Show prompt</summary>```Normal speed, Medium shot, Eye level angle, Third person viewpoint, Static camera movement, Frame-within-frame composition, Shallow depth of field, Natural light, Cinematic style, Desaturated palette with slate blue, dusty rose, and dark wood tones color palette, Dramatic atmosphere. The scene is set on a patio or veranda, framed by a stone archway. In the back, there is a large, weathered wooden gate set into a stone wall. Six people are gathered on a stone patio in front of a large wooden gate. On the right, two men are seated at a dark wooden table. An older man in a grey traditional jacket holds a cane and gestures with his right hand while speaking. A younger man in a light grey suit sits beside him, listening. On the left side of the frame, a man in a dark suit stands with his back to the camera. Next to him, a woman in a pink patterned cheongsam and a woman in a grey skirt suit are standing close together, whispering. The women then turn and smile towards the men at the table. The man in the dark suit turns to face the group, revealing a newborn baby cradled in his arms, wrapped in a pink blanket. He takes a few steps forward, holding the baby. The women look at him and the infant. The older man at the table continues to talk, now gesturing towards the man with the baby. The man holding the baby looks down at the infant as he continues to walk slowly. The table is set with white cups, plates, fruit, and a dark wooden box.```</details>|



## ๐Ÿ“œ Requirements

### Hardware Requirements

We recommend using  GPUs with at least 80GB of memory for better generation quality.

### Software Requirements

* **OS**: Linux
* **CUDA**: 12.4

## ๐Ÿ› ๏ธ Installation

### Step 1: Clone Repository
```bash
git clone https://github.com/Tencent-Hunyuan/HY-Video-PRFL.git
cd HY-Video-PRFL
```

### Step 2: Setup Environment

We recommend CUDA versions 12.4 for installation. Conda's installation instructions are available [here](https://www.anaconda.com/docs/main).
```bash
# Create conda environment
conda create -n HY-Video-PRFL python==3.10

# Activate environment
conda activate HY-Video-PRFL

# Install PyTorch and dependencies (CUDA 12.4)
pip3 install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121

# Install additional dependencies
pip3 install git+https://github.com/huggingface/transformers qwen-vl-utils[decord]
pip3 install git+https://github.com/huggingface/diffusers
pip3 install xfuser -i https://pypi.org/simple
pip3 install flash-attn==2.5.0 --no-build-isolation
pip3 install -e .
pip3 install nvidia-cublas-cu12==12.4.5.8

export PYTHONPATH=./
```

## ๐Ÿงฑ Download Models

Download the pretrained models before training or inference:

| Model | Resolution | Download Links | Notes |
|-------|-----------|----------------|-------|
| **Wan2.1-T2V-14B** | 480P & 720P | ๐Ÿค— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) <br> ๐Ÿค– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B) | Text-to-Video model |
| **Wan2.1-I2V-14B-720P** | 720P | ๐Ÿค— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P) <br> ๐Ÿค– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P) | Image-to-Video (High-res) |
| **Wan2.1-I2V-14B-480P** | 480P | ๐Ÿค— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) <br> ๐Ÿค– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P) | Image-to-Video (Standard) |

First, make sure you have installed the huggingface CLI or modelscope CLI.
```
pip install -U "huggingface_hub[cli]"
pip install modelscope
```
Then, download the pretrained DiT and VAE checkpoints. For example, you can use the following command to download the WAN2.1 checkpoint of 720P I2V task to ```./weights``` by default.
```
hf download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./weights/Wan2.1-I2V-14B-720P
```

## ๐ŸŽ“ Training

### 1๏ธโƒฃ Data Preprocess on single GPU
```bash
python3 scripts/preprocess/gen_wanx_latent.py --config configs/pre_480.yaml
```

We provide several videos in ```temp_data/videos``` as template training data and an input json file ```temp_data/temp_input_data.json```template for preprocess. ```configs/pre_480.yaml``` is for 480P latent extraction and ```configs/pre_720.yaml``` is for 720P. The ```json_path``` and ```save_dir``` in config file can be customized with your own training data.

### 2๏ธโƒฃ Data Annotation and Format Conversion

The annotation for reward model (e.g. ```"physics_quality": 1, "human_quality": 1```) should be added in the data meta files (e.g. ```temp_data/480/meta_v1/0004e625d5bcb80130e1ea3d204e2488_meta_v1.json```). Thus we get meta file list ```temp_data/temp_data_480.list``` and ```temp_data/temp_data_720.list``` which can be used in PAVRM and PRFL training.

### 3๏ธโƒฃ Parallel PAVRM Training on Multiple GPUs

For example, to train PAVRM with 8 GPUs, you can use the following command.

```bash
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29500 scripts/pavrm/train_pavrm.py --config configs/train_pavrm_i2v_720.yaml
```

The ```meta_file_list``` and ```val_meta_file_list``` in config file can be customized with your own training and validation data. We provide several config files for different settings t2v or i2v, 480P or 720P. To be noted that, we train PAVRM with ce loss. To train PAVRM with bt loss, you can use the config file of ```configs/train_pavrm_bt_i2v_720.yaml```.

### 4๏ธโƒฃ Parallel PRFL Training on Multiple GPUs

```bash
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29500 scripts/prfl/train_prfl.py --config configs/train_prfl_i2v_720.yaml
```

The ```meta_file_list``` in config file can be customized with your own training data, ```lrm_transformer_path```, ```lrm_mlp_path``` and ```lrm_query_attention_path``` in config file are for your reward model obtained from the previous step. We provide several config files for different settings t2v or i2v, 480P or 720P.


## ๐Ÿš€ Inference

### 1๏ธโƒฃ Parallel PAVRM Inference on Multiple GPUs

```bash
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29500 scripts/pavrm/inference_pavrm.py --config configs/infer_pavrm_i2v_720.yaml
```

The ```val_meta_file_list``` in config file can be customized with your own inference data, ```resume_transformer_path```, ```resume_mlp_path``` and ```resume_query_attention_path``` in config file are for your reward model to be tested.

### 2๏ธโƒฃ Parallel PRFL Inference on Multiple GPUs

The PRFL Inference is exactly same as its base model (e.g. Wan2.1).

```bash
export negative_prompt="่‰ฒ่ฐƒ่‰ณไธฝ๏ผŒ่ฟ‡ๆ›๏ผŒ้™ๆ€๏ผŒ็ป†่Š‚ๆจก็ณŠไธๆธ…๏ผŒๅญ—ๅน•๏ผŒ้ฃŽๆ ผ๏ผŒไฝœๅ“๏ผŒ็”ปไฝœ๏ผŒ็”ป้ข๏ผŒ้™ๆญข๏ผŒๆ•ดไฝ“ๅ‘็ฐ๏ผŒๆœ€ๅทฎ่ดจ้‡๏ผŒไฝŽ่ดจ้‡๏ผŒJPEGๅŽ‹็ผฉๆฎ‹็•™๏ผŒไธ‘้™‹็š„๏ผŒๆฎ‹็ผบ็š„๏ผŒๅคšไฝ™็š„ๆ‰‹ๆŒ‡๏ผŒ็”ปๅพ—ไธๅฅฝ็š„ๆ‰‹้ƒจ๏ผŒ็”ปๅพ—ไธๅฅฝ็š„่„ธ้ƒจ๏ผŒ็•ธๅฝข็š„๏ผŒๆฏๅฎน็š„๏ผŒๅฝขๆ€็•ธๅฝข็š„่‚ขไฝ“๏ผŒๆ‰‹ๆŒ‡่žๅˆ๏ผŒ้™ๆญขไธๅŠจ็š„็”ป้ข๏ผŒๆ‚ไนฑ็š„่ƒŒๆ™ฏ๏ผŒไธ‰ๆก่…ฟ๏ผŒ่ƒŒๆ™ฏไบบๅพˆๅคš๏ผŒๅ€’็€่ตฐ"
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29500 scripts/prfl/inference_prfl.py \
    --dit_fsdp \
    --t5_fsdp \
    --ulysses_size 1 \
    --task "i2v-14B"\
    --ckpt_dir "weights/Wan2.1-I2V-14B-720P" \
    --lora_path "" \
    --lora_alpha 0 \
    --dataset_path "temp_data/temp_prfl_infer_data.json" \
    --negative_prompt "$negative_prompt" \
    --size "1280*720" \
    --frame_num 81 \
    --sample_steps 40 \
    --sample_guide_scale 5.0 \
    --sample_shift 5.0 \
    --teacache_thresh 0 \
    --save_folder outputs/infer/prfl_i2v_720 \
    --transformer_path <YOUR_CKPT_PATH> \
    --offload_model False
```

**Parameters:**
- `--dit_fsdp` `--t5_fsdp`: Enable FSDP for memory efficiency
- `--task`: "t2v-14B" or "i2v-14B"
- `--ckpt_dir`: Path to pretrained checkpoint file
- `--lora_path` `--lora_alpha`: Path and load weight ratio for LoRA checkpoint file
- `--dataset_path`: Path to inference dataset file
- `--size`: Output resolution ("1280\*720" or "832\*480")
- `--frame_num`: Number of frames to generate (default: 81)
- `--sample_steps`: Number of inference steps (default: 40)
- `--sample_guide_scale`: Classifier-free guidance scale (default: 5.0)
- `--sample_shift`: Flow shift (default: 5.0)
- `--save_folder`: Path to save generated videos
- `--teacache_thresh`: Enable teacache
- `--transformer_path`: Path to your PRFL checkpoint file
- `--offload_model`: Offload to CPU to save GPU memory

## ๐Ÿ“ Citation

If you find **HY-Video-PRFL** useful for your research, please cite:
```bibtex
@article{mi2025video,
  title={Video Generation Models are Good Latent Reward Models},
  author={Mi, Xiaoyue and Yu, Wenqing and Lian, Jiesong and Jie, Shibo and Zhong, Ruizhe and Liu, Zijun and Zhang, Guozhen and Zhou, Zixiang and Xu, Zhiyong and Zhou, Yuan and Lu, Qinglin and Tang, Fan},
  journal={arXiv preprint arXiv:2511.21541},
  year={2025}
}
```


## ๐Ÿ™ Acknowledgements

We sincerely thank the contributors to the following projects:
- [FastVideo](https://github.com/hao-ai-lab/FastVideo)
- [HunyuanVideo](https://github.com/Tencent/HunyuanVideo)
- [Wan2.1](https://github.com/Wan-Video/Wan2.1)
- [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium)
- [ImageReward](https://github.com/THUDM/ImageReward)
- [Diffusers](https://github.com/huggingface/diffusers)
- [HuggingFace](https://huggingface.co)
- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed)




---

<div align="center">
  
**Star โญ this repo if you find it helpful!**

</div>