Pusa-V0.5 / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add metadata, links, and usage example
4ca9ae8 verified
|
raw
history blame
10.9 kB
metadata
pipeline_tag: image-to-video
library_name: diffusers
license: apache-2.0
tags:
  - text-to-video
  - diffusion-models
  - video-generation

Pusa VidGen

Code Repository | Project Page | Model Hub | Training Toolkit | Dataset | Pusa Paper | FVDM Paper | Follow on X | Xiaohongshu

Overview

Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This shift was first presented in our FVDM paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on Mochi1-Preview. We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.

✨ Key Features

  • Comprehensive Multi-task Support:

    • Text-to-Video generation
    • Image-to-Video transformation
    • Frame interpolation
    • Video transitions
    • Seamless looping
    • Extended video generation
    • And more...
  • Unprecedented Efficiency:

    • Trained with only 0.1k H800 GPU hours
    • Total training cost: $0.1k
    • Hardware: 16 H800 GPUs
    • Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
    • Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!
  • Complete Open-Source Release:

    • Full codebase
    • Detailed architecture specifications
    • Comprehensive training methodology

🔍 Unique Architecture

  • Novel Diffusion Paradigm: Implements frame-level noise control with vectorized timesteps, originally introduced in the FVDM paper, enabling unprecedented flexibility and scalability.

  • Non-destructive Modification: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.

  • Universal Applicability: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. Collaborations enthusiastically welcomed!

Installation and Usage

Download Weights

Option 1: Use the Hugging Face CLI:

pip install huggingface_hub
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>

Option 2: Download directly from Hugging Face to your local machine.

Sample Usage: Image-to-Video Generation

To generate videos from an image input, you can use the following Python code. This example shows how to perform image-to-video generation using the Pusa model.

First, ensure you have the necessary libraries installed:

pip install torch transformers diffusers pillow imageio numpy torchvision
pip install uv # For installing genmo models
# Then, navigate to a directory where you want to clone the genmo models
git clone https://github.com/genmoai/models
cd models
uv venv .venv
source .venv/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation
# If you want flash attention:
# uv pip install -e .[flash] --no-build-isolation

Now, you can run the following Python script. Remember to replace "path/to/Pusa-V0.5" with the actual local path where you downloaded the model weights.

import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from diffusers import FlowMatchEulerDiscreteScheduler
from genmo.mochi_preview.pipelines import MochiPipeline

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# Load the pipeline
# The path below assumes Pusa-V0.5 is downloaded to a directory named Pusa-V0.5
# relative to where you run the script, or provide the full path.
pipeline = MochiPipeline.from_pretrained(
    "RaphaelLiu/Pusa-V0.5", # or "/path/to/Pusa-V0.5"
    torch_dtype=torch.float16,
)
pipeline.to("cuda") # Ensure pipeline is moved to GPU

# Load the additional DIT weights for Pusa
# Make sure the path to pusa_v0_dit.safetensors is correct
dit_weights_path = "RaphaelLiu/Pusa-V0.5/pusa_v0_dit.safetensors" # Adjust if your download path is different
pipeline.transformer.load_state_dict(torch.load(dit_weights_path), strict=False)

# Example parameters for generation
prompt = "The camera remains still, the man is surfing on a wave with his surfboard."
# Create a dummy image for demonstration if actual image is not present
# In a real scenario, replace with a path to a .jpg image file
try:
    image = Image.open("./demos/example.jpg").convert('RGB') # Assumes running from Pusa-VidGen root
except FileNotFoundError:
    print("Example image not found. Creating a dummy image for demonstration.")
    image = Image.new('RGB', (512, 512), color = 'red')
    # Save the dummy image for use by the script
    image.save("./demos/example.jpg")
image_path = "./demos/example.jpg"

cond_position = 0
num_steps = 30
noise_multiplier = 0.4

# Load and preprocess the image (using feature_extractor from the pipeline)
image_tensor = pipeline.feature_extractor.preprocess(image, return_tensors="pt").pixel_values
image_tensor = image_tensor.to(pipeline.device, pipeline.dtype)

# Generate video
video_frames = pipeline(
    prompt=prompt,
    image=image_tensor,
    cond_position=cond_position,
    num_inference_steps=num_steps,
    noise_multiplier=noise_multiplier,
    generator=torch.Generator(device=pipeline.device).manual_seed(0),
).frames[0]

# Save or display the video frames
# Example: Save frames as a GIF (requires imageio, Pillow)
import imageio

output_gif_path = "output_video.gif"
imageio.mimsave(output_gif_path, [Image.fromarray(f) for f in video_frames], fps=10)
print(f"Video saved to {output_gif_path}")

Limitations

Pusa currently has several known limitations:

  • The base Mochi model generates videos at relatively low resolution (480p)
  • We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
  • We welcome community contributions to enhance model performance and extend its capabilities

Related Work

  • FVDM: Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
  • Mochi: Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.

Citation

If you find our work useful in your research, please consider citing:

@misc{Liu2025pusa,
  title={Pusa: Thousands Timesteps Video Diffusion Model},
  author={Yaofang Liu and Rui Liu},
  year={2025},
  url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}
@article{liu2024redefining,
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
  journal={arXiv preprint arXiv:2410.03160},
  year={2024}
}