Pusa-V0.5 / README.md

nielsr HF Staff

Improve model card: Add metadata, links, and usage example

4ca9ae8 verified 9 months ago

10.9 kB

pipeline_tag: image-to-video
library_name: diffusers
license: apache-2.0
tags:
  - text-to-video
  - diffusion-models
  - video-generation

Pusa VidGen

Overview

Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This shift was first presented in our FVDM paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on Mochi1-Preview. We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.

✨ Key Features

Comprehensive Multi-task Support:
- Text-to-Video generation
- Image-to-Video transformation
- Frame interpolation
- Video transitions
- Seamless looping
- Extended video generation
- And more...
Unprecedented Efficiency:
- Trained with only 0.1k H800 GPU hours
- Total training cost: $0.1k
- Hardware: 16 H800 GPUs
- Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
- Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!
Complete Open-Source Release:
- Full codebase
- Detailed architecture specifications
- Comprehensive training methodology

🔍 Unique Architecture

Novel Diffusion Paradigm: Implements frame-level noise control with vectorized timesteps, originally introduced in the FVDM paper, enabling unprecedented flexibility and scalability.
Non-destructive Modification: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
Universal Applicability: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. Collaborations enthusiastically welcomed!

Installation and Usage

Download Weights

Option 1: Use the Hugging Face CLI:

pip install huggingface_hub
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>

Option 2: Download directly from Hugging Face to your local machine.

Sample Usage: Image-to-Video Generation

To generate videos from an image input, you can use the following Python code. This example shows how to perform image-to-video generation using the Pusa model.

First, ensure you have the necessary libraries installed:

pip install torch transformers diffusers pillow imageio numpy torchvision
pip install uv # For installing genmo models
# Then, navigate to a directory where you want to clone the genmo models
git clone https://github.com/genmoai/models
cd models
uv venv .venv
source .venv/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation
# If you want flash attention:
# uv pip install -e .[flash] --no-build-isolation

Now, you can run the following Python script. Remember to replace "path/to/Pusa-V0.5" with the actual local path where you downloaded the model weights.

import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from diffusers import FlowMatchEulerDiscreteScheduler
from genmo.mochi_preview.pipelines import MochiPipeline

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# Load the pipeline
# The path below assumes Pusa-V0.5 is downloaded to a directory named Pusa-V0.5
# relative to where you run the script, or provide the full path.
pipeline = MochiPipeline.from_pretrained(
    "RaphaelLiu/Pusa-V0.5", # or "/path/to/Pusa-V0.5"
    torch_dtype=torch.float16,
)
pipeline.to("cuda") # Ensure pipeline is moved to GPU

# Load the additional DIT weights for Pusa
# Make sure the path to pusa_v0_dit.safetensors is correct
dit_weights_path = "RaphaelLiu/Pusa-V0.5/pusa_v0_dit.safetensors" # Adjust if your download path is different
pipeline.transformer.load_state_dict(torch.load(dit_weights_path), strict=False)

# Example parameters for generation
prompt = "The camera remains still, the man is surfing on a wave with his surfboard."
# Create a dummy image for demonstration if actual image is not present
# In a real scenario, replace with a path to a .jpg image file
try:
    image = Image.open("./demos/example.jpg").convert('RGB') # Assumes running from Pusa-VidGen root
except FileNotFoundError:
    print("Example image not found. Creating a dummy image for demonstration.")
    image = Image.new('RGB', (512, 512), color = 'red')
    # Save the dummy image for use by the script
    image.save("./demos/example.jpg")
image_path = "./demos/example.jpg"

cond_position = 0
num_steps = 30
noise_multiplier = 0.4

# Load and preprocess the image (using feature_extractor from the pipeline)
image_tensor = pipeline.feature_extractor.preprocess(image, return_tensors="pt").pixel_values
image_tensor = image_tensor.to(pipeline.device, pipeline.dtype)

# Generate video
video_frames = pipeline(
    prompt=prompt,
    image=image_tensor,
    cond_position=cond_position,
    num_inference_steps=num_steps,
    noise_multiplier=noise_multiplier,
    generator=torch.Generator(device=pipeline.device).manual_seed(0),
).frames[0]

# Save or display the video frames
# Example: Save frames as a GIF (requires imageio, Pillow)
import imageio

output_gif_path = "output_video.gif"
imageio.mimsave(output_gif_path, [Image.fromarray(f) for f in video_frames], fps=10)
print(f"Video saved to {output_gif_path}")

Limitations

Pusa currently has several known limitations:

The base Mochi model generates videos at relatively low resolution (480p)
We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
We welcome community contributions to enhance model performance and extend its capabilities

Related Work

FVDM: Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
Mochi: Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.

Citation

If you find our work useful in your research, please consider citing:

@misc{Liu2025pusa,
  title={Pusa: Thousands Timesteps Video Diffusion Model},
  author={Yaofang Liu and Rui Liu},
  year={2025},
  url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}

@article{liu2024redefining,
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
  journal={arXiv preprint arXiv:2410.03160},
  year={2024}
}