pipeline_tag: image-to-video
library_name: diffusers
license: apache-2.0
tags:
- text-to-video
- diffusion-models
- video-generation
Pusa VidGen
Code Repository | Project Page | Model Hub | Training Toolkit | Dataset | Pusa Paper | FVDM Paper | Follow on X | Xiaohongshu
Overview
Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This shift was first presented in our FVDM paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on Mochi1-Preview. We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.
✨ Key Features
Comprehensive Multi-task Support:
- Text-to-Video generation
- Image-to-Video transformation
- Frame interpolation
- Video transitions
- Seamless looping
- Extended video generation
- And more...
Unprecedented Efficiency:
- Trained with only 0.1k H800 GPU hours
- Total training cost: $0.1k
- Hardware: 16 H800 GPUs
- Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
- Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!
Complete Open-Source Release:
- Full codebase
- Detailed architecture specifications
- Comprehensive training methodology
🔍 Unique Architecture
Novel Diffusion Paradigm: Implements frame-level noise control with vectorized timesteps, originally introduced in the FVDM paper, enabling unprecedented flexibility and scalability.
Non-destructive Modification: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
Universal Applicability: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. Collaborations enthusiastically welcomed!
Installation and Usage
Download Weights
Option 1: Use the Hugging Face CLI:
pip install huggingface_hub
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
Option 2: Download directly from Hugging Face to your local machine.
Sample Usage: Image-to-Video Generation
To generate videos from an image input, you can use the following Python code. This example shows how to perform image-to-video generation using the Pusa model.
First, ensure you have the necessary libraries installed:
pip install torch transformers diffusers pillow imageio numpy torchvision
pip install uv # For installing genmo models
# Then, navigate to a directory where you want to clone the genmo models
git clone https://github.com/genmoai/models
cd models
uv venv .venv
source .venv/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation
# If you want flash attention:
# uv pip install -e .[flash] --no-build-isolation
Now, you can run the following Python script. Remember to replace "path/to/Pusa-V0.5" with the actual local path where you downloaded the model weights.
import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from diffusers import FlowMatchEulerDiscreteScheduler
from genmo.mochi_preview.pipelines import MochiPipeline
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
# Load the pipeline
# The path below assumes Pusa-V0.5 is downloaded to a directory named Pusa-V0.5
# relative to where you run the script, or provide the full path.
pipeline = MochiPipeline.from_pretrained(
"RaphaelLiu/Pusa-V0.5", # or "/path/to/Pusa-V0.5"
torch_dtype=torch.float16,
)
pipeline.to("cuda") # Ensure pipeline is moved to GPU
# Load the additional DIT weights for Pusa
# Make sure the path to pusa_v0_dit.safetensors is correct
dit_weights_path = "RaphaelLiu/Pusa-V0.5/pusa_v0_dit.safetensors" # Adjust if your download path is different
pipeline.transformer.load_state_dict(torch.load(dit_weights_path), strict=False)
# Example parameters for generation
prompt = "The camera remains still, the man is surfing on a wave with his surfboard."
# Create a dummy image for demonstration if actual image is not present
# In a real scenario, replace with a path to a .jpg image file
try:
image = Image.open("./demos/example.jpg").convert('RGB') # Assumes running from Pusa-VidGen root
except FileNotFoundError:
print("Example image not found. Creating a dummy image for demonstration.")
image = Image.new('RGB', (512, 512), color = 'red')
# Save the dummy image for use by the script
image.save("./demos/example.jpg")
image_path = "./demos/example.jpg"
cond_position = 0
num_steps = 30
noise_multiplier = 0.4
# Load and preprocess the image (using feature_extractor from the pipeline)
image_tensor = pipeline.feature_extractor.preprocess(image, return_tensors="pt").pixel_values
image_tensor = image_tensor.to(pipeline.device, pipeline.dtype)
# Generate video
video_frames = pipeline(
prompt=prompt,
image=image_tensor,
cond_position=cond_position,
num_inference_steps=num_steps,
noise_multiplier=noise_multiplier,
generator=torch.Generator(device=pipeline.device).manual_seed(0),
).frames[0]
# Save or display the video frames
# Example: Save frames as a GIF (requires imageio, Pillow)
import imageio
output_gif_path = "output_video.gif"
imageio.mimsave(output_gif_path, [Image.fromarray(f) for f in video_frames], fps=10)
print(f"Video saved to {output_gif_path}")
Limitations
Pusa currently has several known limitations:
- The base Mochi model generates videos at relatively low resolution (480p)
- We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
- We welcome community contributions to enhance model performance and extend its capabilities
Related Work
- FVDM: Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
- Mochi: Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
Citation
If you find our work useful in your research, please consider citing:
@misc{Liu2025pusa,
title={Pusa: Thousands Timesteps Video Diffusion Model},
author={Yaofang Liu and Rui Liu},
year={2025},
url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}
@article{liu2024redefining,
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
journal={arXiv preprint arXiv:2410.03160},
year={2024}
}