Aligning Text-to-Image Diffusion Models with Reward Backpropagation
Paper
• 2310.03739 • Published
• 22
YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
We explore the Reward Backpropagation technique 1 2 to optimized the generated videos by EasyAnimateV5 for better alignment with human preferences. We provide pre-trained models (i.e. LoRAs) along with the training script. You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.
For more details, please refer to our GitHub repo.
| Name | Base Model | Reward Model | Hugging Face | Description |
|---|---|---|---|---|
| EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors | EasyAnimateV5-12b-zh-InP | HPS v2.1 | 🤗Link | Official HPS v2.1 reward LoRA (rank=128 and network_alpha=64) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps. |
| EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors | EasyAnimateV5-7b-zh-InP | HPS v2.1 | 🤗Link | Official HPS v2.1 reward LoRA (rank=128 and network_alpha=64) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 3,500 steps. |
| EasyAnimateV5-12b-zh-InP-MPS.safetensors | EasyAnimateV5-12b-zh-InP | MPS | 🤗Link | Official MPS reward LoRA (rank=128 and network_alpha=64) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps. |
| EasyAnimateV5-7b-zh-InP-MPS.safetensors | EasyAnimateV5-7b-zh-InP | MPS | 🤗Link | Official MPS reward LoRA (rank=128 and network_alpha=64) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 2,000 steps. |
| Prompt | EasyAnimateV5-12b-zh-InP | EasyAnimateV5-12b-zh-InP HPSv2.1 Reward LoRA |
EasyAnimateV5-12b-zh-InP MPS Reward LoRA |
|---|---|---|---|
| Porcelain rabbit hopping by a golden cactus | |||
| Yellow rubber duck floating next to a blue bath towel | |||
| An elephant sprays water with its trunk, a lion sitting nearby | |||
| A fish swims gracefully in a tank as a horse gallops outside |
| Prompt | EasyAnimateV5-7b-zh-InP | EasyAnimateV5-7b-zh-InP HPSv2.1 Reward LoRA |
EasyAnimateV5-7b-zh-InP MPS Reward LoRA |
|---|---|---|---|
| Crystal cake shimmering beside a metal apple | |||
| Elderly artist with a white beard painting on a white canvas | |||
| Porcelain rabbit hopping by a golden cactus | |||
| Green parrot perching on a brown chair |
The above test prompts are from T2V-CompBench. All videos are generated with lora weight 0.7.
We provide an example inference code to run EasyAnimateV5-12b-zh-InP with its HPS2.1 reward LoRA.
import torch
from diffusers import DDIMScheduler
from omegaconf import OmegaConf
from transformers import BertModel, BertTokenizer, T5EncoderModel, T5Tokenizer
from easyanimate.models import AutoencoderKLMagvit, EasyAnimateTransformer3DModel
from easyanimate.pipeline.pipeline_easyanimate_multi_text_encoder_inpaint import EasyAnimatePipeline_Multi_Text_Encoder_Inpaint
from easyanimate.utils.lora_utils import merge_lora
from easyanimate.utils.utils import get_image_to_video_latent, save_videos_grid
from easyanimate.utils.fp8_optimization import convert_weight_dtype_wrapper
# GPU memory mode, which can be choosen in [model_cpu_offload, model_cpu_offload_and_qfloat8, sequential_cpu_offload].
GPU_memory_mode = "model_cpu_offload"
# Download from https://raw.githubusercontent.com/aigc-apps/EasyAnimate/refs/heads/main/config/easyanimate_video_v5_magvit_multi_text_encoder.yaml
config_path = "config/easyanimate_video_v5_magvit_multi_text_encoder.yaml"
model_path = "alibaba-pai/EasyAnimateV5-12b-zh-InP"
lora_path = "alibaba-pai/EasyAnimateV5-Reward-LoRAs/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors"
weight_dtype = torch.bfloat16
lora_weight = 0.7
prompt = "A panda eats bamboo while a monkey swings from branch to branch"
sample_size = [512, 512]
video_length = 49
config = OmegaConf.load(config_path)
transformer_additional_kwargs = OmegaConf.to_container(config['transformer_additional_kwargs'])
if weight_dtype == torch.float16:
transformer_additional_kwargs["upcast_attention"] = True
transformer = EasyAnimateTransformer3DModel.from_pretrained_2d(
model_path,
subfolder="transformer",
transformer_additional_kwargs=transformer_additional_kwargs,
torch_dtype=torch.float8_e4m3fn if GPU_memory_mode == "model_cpu_offload_and_qfloat8" else weight_dtype,
low_cpu_mem_usage=True,
)
vae = AutoencoderKLMagvit.from_pretrained(
model_path, subfolder="vae", vae_additional_kwargs=OmegaConf.to_container(config['vae_kwargs'])
).to(weight_dtype)
if config['vae_kwargs'].get('vae_type', 'AutoencoderKL') == 'AutoencoderKLMagvit' and weight_dtype == torch.float16:
vae.upcast_vae = True
pipeline = EasyAnimatePipeline_Multi_Text_Encoder_Inpaint.from_pretrained(
model_path,
text_encoder=BertModel.from_pretrained(model_path, subfolder="text_encoder").to(weight_dtype),
text_encoder_2=T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder_2").to(weight_dtype),
tokenizer=BertTokenizer.from_pretrained(model_path, subfolder="tokenizer"),
tokenizer_2=T5Tokenizer.from_pretrained(model_path, subfolder="tokenizer_2"),
vae=vae,
transformer=transformer,
scheduler=DDIMScheduler.from_pretrained(model_path, subfolder="scheduler"),
torch_dtype=weight_dtype
)
if GPU_memory_mode == "sequential_cpu_offload":
pipeline.enable_sequential_cpu_offload()
elif GPU_memory_mode == "model_cpu_offload_and_qfloat8":
pipeline.enable_model_cpu_offload()
convert_weight_dtype_wrapper(pipeline.transformer, weight_dtype)
else:
pipeline.enable_model_cpu_offload()
pipeline = merge_lora(pipeline, lora_path, lora_weight)
generator = torch.Generator(device="cuda").manual_seed(42)
input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
sample = pipeline(
prompt,
video_length = video_length,
negative_prompt = "bad detailed",
height = sample_size[0],
width = sample_size[1],
generator = generator,
guidance_scale = 7.0,
num_inference_steps = 50,
video = input_video,
mask_video = input_video_mask,
).videos
save_videos_grid(sample, "samples/output.mp4", fps=8)