| # Wan2.2-Fun-Reward-LoRAs | |
| ## Introduction | |
| We explore the Reward Backpropagation technique <sup>[1](#ref1) [2](#ref2)</sup> to optimized the generated videos by [Wan2.2-Fun](https://github.com/aigc-apps/VideoX-Fun) for better alignment with human preferences. | |
| We provide the following pre-trained models (i.e. LoRAs) along with [the training script](https://github.com/aigc-apps/VideoX-Fun/blob/main/scripts/wan2.2_fun/train_reward_lora.py). You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA. | |
| For more details, please refer to our [GitHub repo](https://github.com/aigc-apps/VideoX-Fun). | |
| | Name | Base Model | Reward Model | Hugging Face | Description | | |
| |--|--|--|--|--| | |
| | Wan2.2-Fun-A14B-InP-high-noise-HPS2.1.safetensors | [Wan2.2-Fun-A14B-InP (high noise)](https://huggingface.co/alibaba-pai/Wan2.2-Fun-A14B-InP/tree/main/high_noise_model) | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/Wan2.2-Fun-Reward-LoRAs/resolve/main/Wan2.2-Fun-A14B-InP-high-noise-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for Wan2.2-Fun-A14B-InP (high noise). It is trained with a batch size of 8 for 5,000 steps.| | |
| | Wan2.2-Fun-A14B-InP-low-noise-HPS2.1.safetensors | [Wan2.2-Fun-A14B-InP (low noise)](https://huggingface.co/alibaba-pai/Wan2.2-Fun-A14B-InP/tree/main/low_noise_model) | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/Wan2.2-Fun-Reward-LoRAs/resolve/main/Wan2.2-Fun-A14B-InP-low-noise-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for Wan2.2-Fun-A14B-InP (low noise). It is trained with a batch size of 8 for 2,700 steps.| | |
| | Wan2.2-Fun-A14B-InP-high-noise-MPS.safetensors | [Wan2.2-Fun-A14B-InP (high noise)](https://huggingface.co/alibaba-pai/Wan2.2-Fun-A14B-InP/tree/main/high_noise_model) | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/Wan2.2-Fun-Reward-LoRAs/resolve/main/Wan2.2-Fun-A14B-InP-high-noise-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for Wan2.2-Fun-A14B-InP (high noise). It is trained with a batch size of 8 for 5,000 steps.| | |
| | Wan2.2-Fun-A14B-InP-low-noise-MPS.safetensors | [Wan2.2-Fun-A14B-InP (low noise)](https://huggingface.co/alibaba-pai/Wan2.2-Fun-A14B-InP/tree/main/low_noise_model) | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/Wan2.1-Fun-Reward-LoRAs/resolve/main/Wan2.1-Fun-14B-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for Wan2.2-Fun-A14B-InP (low noise). It is trained with a batch size of 8 for xxx steps.| | |
| > [!NOTE] | |
| > We found that, MPS reward LoRA for the low-noise model converges significantly more slowly than on the other models, and may not deliver satisfactory results. Therefore, for the low-noise model, we recommend using HPSv2.1 reward LoRA. | |
| ## Demo | |
| Please refer to [here](https://huggingface.co/alibaba-pai/Wan2.2-Fun-Reward-LoRAs#demo). | |
| ## Quick Start | |
| Set `lora_path` along with `lora_weight` for the low noise reward LoRA, while specifying `lora_high_path` and `lora_high_weight` for high noise reward LoRA in [examples/wan2.2_fun/predict_t2v.py](https://github.com/aigc-apps/VideoX-Fun/blob/main/examples/wan2.1_fun/predict_t2v.py). | |
| ## Training | |
| The training code is based on [train_lora.py](./train_lora.py). We provide a shell script to train the HPS v2.1 reward LoRA for the low noise model of Wan2.2-Fun-A14B-InP, which can be trained on a single 8*A100 node with 80GB VRAM. To train reward LoRA for the high noise model, Deepspeed Zero3 with CPU offload is required. | |
| Please refer to [Setup](https://github.com/aigc-apps/VideoX-Fun/blob/main/scripts/cogvideox_fun/README_TRAIN_REWARD.md#setup) and [Important Args](https://github.com/aigc-apps/VideoX-Fun/blob/main/scripts/cogvideox_fun/README_TRAIN_REWARD.md#important-args) before training. | |
| ## Limitations | |
| 1. We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve. | |
| The model trickly learns some shortcuts (by adding artifacts in the background, i.e., adversarial patches) to increase the reward. | |
| 2. Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot | |
| evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease | |
| in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists. | |
| ## Reference | |
| <ol> | |
| <li id="ref1">Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.</li> | |
| <li id="ref2">Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).</li> | |
| </ol> | |