license: mit
tags:
- image-to-video
- pytorch
pipeline_tag: image-to-video
library_name: diffusers
RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
Abstract
Recent advancements in camera-trajectory-guided image-to-video generation offer higher precision and better support for complex camera control compared to text-based approaches. However, they also introduce significant usability challenges, as users often struggle to provide precise camera parameters when working with arbitrary real-world images without knowledge of their depth nor scene scale. To address these real-world application issues, we propose RealCam-I2V, a novel diffusion-based video generation framework that integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step. During training, the reconstructed 3D scene enables scaling camera parameters from relative to metric scales, ensuring compatibility and scale consistency across diverse real-world images. In inference, RealCam-I2V offers an intuitive interface where users can precisely draw camera trajectories by dragging within the 3D scene. To further enhance precise camera control and scene consistency, we propose scene-constrained noise shaping, which shapes high-level noise and also allows the framework to maintain dynamic and coherent video generation in lower noise stages. RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images. We further enables applications like camera-controlled looping video generation and generative frame interpolation.
π₯ Gallery
π News
- 25/07/05: Release inference code and checkpoints of RealCam-I2V. We are still actively working on sanitizing the code. More updates of code and checkpoint will follow soon, please stay tuned!
- 25/06/26: RealCam-I2V is accepted by ICCV 2025! ππ
- 25/05/18: Release training code of RealCam-I2V on CogVideoX 1.5.
- 25/03/26: Release our dataset RealCam-Vid v1 for metric-scale camera-controlled video generation!
- 25/02/18: Initial commit of the project, we plan to release our DiT-based real-camera i2v models (e.g., CogVideoX) in this repo.
βοΈ Environment
Quick Start
apt install libgl1-mesa-glx libgl1-mesa-dri xvfb # for ubuntu
yum install -y mesa-libGL mesa-dri-drivers Xvfb. # for centos
conda install ffmpeg=7 -c conda-forge
pip install -r requirements.txt
π« Inference
Download Pretrained Models
Download and put under pretrained folder the pretrained weights of CogVideoX1.5-5B-I2V, Metric3D and Qwen2.5-VL.
Download Model Checkpoints
Download our weights of RealCam-I2V and put under checkpoints folder.
Please edit demo/models.json if you have a custom model path.
Run Gradio Demo
python gradio_app.py
Inference Code Example
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
import cv2
import numpy as np
# Load RealCam-I2V model and processor
model_path = "MuteApo/RealCam-I2V" # Or your local path to the checkpoint
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.float16, # Use torch.float32 for full precision
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Move model to GPU
model.to("cuda")
# Prepare inputs
input_image_path = "./path/to/your/image.jpg" # Replace with your image path
input_image = Image.open(input_image_path).convert("RGB")
# Example camera trajectory (adjust as needed for your desired motion)
# This is a simplified example; full camera control involves more parameters
# See project page or original repo for detailed camera trajectory specifications.
# Here, a simple stationary camera for 16 frames as an illustration.
camera_trajectory = {
"center": [(0, 0, 0) for _ in range(16)], # (x, y, z) position
"look_at": [(0, 0, 1) for _ in range(16)], # (x, y, z) point camera looks at
"up": [(0, 1, 0) for _ in range(16)], # (x, y, z) up vector
"fovy": [45.0 for _ in range(16)], # Field of view in degrees
}
# Process inputs
inputs = processor(
images=input_image,
camera_trajectory=camera_trajectory,
return_tensors="pt"
)
inputs = {k: v.to("cuda") for k, v in inputs.items()}
# Generate video
with torch.no_grad():
video_frames = model.generate(**inputs, num_inference_steps=50).cpu().numpy()
# Save video frames as a GIF or MP4
output_video_path = "./output_video.gif" # or .mp4
# Assuming video_frames are in [B, C, H, W] range [0,1]
# Convert to [B, H, W, C] and scale to [0, 255] for saving
video_frames = (video_frames * 255).astype(np.uint8).transpose(0, 2, 3, 1)
# Example to save as GIF using imageio
from imageio import mimsave
mimsave(output_video_path, video_frames, fps=8) # Adjust fps as needed
print(f"Video saved to {output_video_path}")
π Training
Prepare Dataset
Please access RealCam-Vid and download our dataset for training RealCam-I2V-CogVideoX-1.5. Please unzip all contents in data folder.
Launch
Edit example training script accelerate_train.sh if necessary and launch training by:
bash accelerate_train.sh
π€ Related Repo
- Our dataset, the first open-sourced, combining diverse scene dynamics with metric-scale camera trajectories, is available at RealCam-Vid.
- Our previous work at CamI2V.
- We have borrowed a lot of code from the original CogVideoX repository.
ποΈ Citation
If you find this work useful, please consider citing our papers:
@article{li2025realcam,
title={RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control},
author={Li, Teng and Zheng, Guangcong and Jiang, Rui and Zhan, Shuigen and Wu, Tao and Lu, Yehao and Lin, Yining and Li, Xi},
journal={arXiv preprint arXiv:2502.10059},
year={2025},
}
@article{zheng2024cami2v,
title={CamI2V: Camera-Controlled Image-to-Video Diffusion Model},
author={Zheng, Guangcong and Li, Teng and Jiang, Rui and Lu, Yehao and Wu, Tao and Li, Xi},
journal={arXiv preprint arXiv:2410.15957},
year={2024}
}
