CamI2V / README.md

nielsr HF Staff

Update model card for RealCam-I2V

807f775 verified 10 months ago

7.48 kB

license: mit
tags:
  - image-to-video
  - pytorch
pipeline_tag: image-to-video
library_name: diffusers

RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control

Abstract

Recent advancements in camera-trajectory-guided image-to-video generation offer higher precision and better support for complex camera control compared to text-based approaches. However, they also introduce significant usability challenges, as users often struggle to provide precise camera parameters when working with arbitrary real-world images without knowledge of their depth nor scene scale. To address these real-world application issues, we propose RealCam-I2V, a novel diffusion-based video generation framework that integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step. During training, the reconstructed 3D scene enables scaling camera parameters from relative to metric scales, ensuring compatibility and scale consistency across diverse real-world images. In inference, RealCam-I2V offers an intuitive interface where users can precisely draw camera trajectories by dragging within the 3D scene. To further enhance precise camera control and scene consistency, we propose scene-constrained noise shaping, which shapes high-level noise and also allows the framework to maintain dynamic and coherent video generation in lower noise stages. RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images. We further enables applications like camera-controlled looping video generation and generative frame interpolation.

🎥 Gallery

🌟 News

25/07/05: Release inference code and checkpoints of RealCam-I2V. We are still actively working on sanitizing the code. More updates of code and checkpoint will follow soon, please stay tuned!
25/06/26: RealCam-I2V is accepted by ICCV 2025! 🎉🎉
25/05/18: Release training code of RealCam-I2V on CogVideoX 1.5.
25/03/26: Release our dataset RealCam-Vid v1 for metric-scale camera-controlled video generation!
25/02/18: Initial commit of the project, we plan to release our DiT-based real-camera i2v models (e.g., CogVideoX) in this repo.

⚙️ Environment

Quick Start

apt install libgl1-mesa-glx libgl1-mesa-dri xvfb # for ubuntu
yum install -y mesa-libGL mesa-dri-drivers Xvfb. # for centos
conda install ffmpeg=7 -c conda-forge
pip install -r requirements.txt

💫 Inference

Download Pretrained Models

Download and put under pretrained folder the pretrained weights of CogVideoX1.5-5B-I2V, Metric3D and Qwen2.5-VL.

Download Model Checkpoints

Download our weights of RealCam-I2V and put under checkpoints folder. Please edit demo/models.json if you have a custom model path.

Run Gradio Demo

python gradio_app.py

Inference Code Example

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
import cv2
import numpy as np

# Load RealCam-I2V model and processor
model_path = "MuteApo/RealCam-I2V" # Or your local path to the checkpoint
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.float16, # Use torch.float32 for full precision
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Move model to GPU
model.to("cuda")

# Prepare inputs
input_image_path = "./path/to/your/image.jpg" # Replace with your image path
input_image = Image.open(input_image_path).convert("RGB")

# Example camera trajectory (adjust as needed for your desired motion)
# This is a simplified example; full camera control involves more parameters
# See project page or original repo for detailed camera trajectory specifications.
# Here, a simple stationary camera for 16 frames as an illustration.
camera_trajectory = {
    "center": [(0, 0, 0) for _ in range(16)],  # (x, y, z) position
    "look_at": [(0, 0, 1) for _ in range(16)], # (x, y, z) point camera looks at
    "up": [(0, 1, 0) for _ in range(16)],      # (x, y, z) up vector
    "fovy": [45.0 for _ in range(16)],         # Field of view in degrees
}

# Process inputs
inputs = processor(
    images=input_image,
    camera_trajectory=camera_trajectory,
    return_tensors="pt"
)
inputs = {k: v.to("cuda") for k, v in inputs.items()}

# Generate video
with torch.no_grad():
    video_frames = model.generate(**inputs, num_inference_steps=50).cpu().numpy()

# Save video frames as a GIF or MP4
output_video_path = "./output_video.gif" # or .mp4
# Assuming video_frames are in [B, C, H, W] range [0,1]
# Convert to [B, H, W, C] and scale to [0, 255] for saving
video_frames = (video_frames * 255).astype(np.uint8).transpose(0, 2, 3, 1)

# Example to save as GIF using imageio
from imageio import mimsave
mimsave(output_video_path, video_frames, fps=8) # Adjust fps as needed

print(f"Video saved to {output_video_path}")

🚀 Training

Prepare Dataset

Please access RealCam-Vid and download our dataset for training RealCam-I2V-CogVideoX-1.5. Please unzip all contents in data folder.

Launch

Edit example training script accelerate_train.sh if necessary and launch training by:

bash accelerate_train.sh

🤗 Related Repo

Our dataset, the first open-sourced, combining diverse scene dynamics with metric-scale camera trajectories, is available at RealCam-Vid.
Our previous work at CamI2V.
We have borrowed a lot of code from the original CogVideoX repository.

🗒️ Citation

If you find this work useful, please consider citing our papers:

@article{li2025realcam,
    title={RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control}, 
    author={Li, Teng and Zheng, Guangcong and Jiang, Rui and Zhan, Shuigen and Wu, Tao and Lu, Yehao and Lin, Yining and Li, Xi},
    journal={arXiv preprint arXiv:2502.10059},
    year={2025},
}

@article{zheng2024cami2v,
    title={CamI2V: Camera-Controlled Image-to-Video Diffusion Model},
    author={Zheng, Guangcong and Li, Teng and Jiang, Rui and Lu, Yehao and Wu, Tao and Li, Xi},
    journal={arXiv preprint arXiv:2410.15957},
    year={2024}
}