Update model card for RealCam-I2V

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +112 -121
README.md CHANGED
@@ -3,168 +3,159 @@ license: mit
3
  tags:
4
  - image-to-video
5
  - pytorch
 
 
6
  ---
7
- # CamI2V: Camera-Controlled Image-to-Video Diffusion Model
 
8
 
9
  <div align="center">
10
- <a href="https://arxiv.org/abs/2410.15957">
11
- <img src="https://img.shields.io/static/v1?label=arXiv&message=2410.15957&color=b21d1a" style="display: inline-block; vertical-align: middle;">
12
- </a>
13
- <a href="https://zgctroy.github.io/CamI2V">
14
- <img src="https://img.shields.io/static/v1?label=Project&message=Page&color=green" style="display: inline-block; vertical-align: middle;">
15
- </a>
16
- <a href="https://huggingface.co/MuteApo/CamI2V/tree/main">
17
- <img src="https://img.shields.io/static/v1?label=HuggingFace&message=Checkpoints&color=blue" style="display: inline-block; vertical-align: middle;">
18
- </a>
19
- </div>
20
 
21
  ## ๐ŸŽฅ Gallery
 
 
 
22
 
23
- <table>
24
- <tr>
25
- <td align="center">
26
- rightward rotation and zoom in<br>(CFG=4, FS=6, step=50, ratio=0.6, scale=0.1)
27
- </td>
28
- <td align="center">
29
- leftward rotation and zoom in<br>(CFG=4, FS=6, step=50, ratio=0.6, scale=0.1)
30
- </td>
31
- </tr>
32
- <tr>
33
- <td align="center">
34
- <img src="https://github.com/user-attachments/assets/74a764f4-0631-4fbe-94b9-af51057f99a5" width="75%">
35
- </td>
36
- <td align="center">
37
- <img src="https://github.com/user-attachments/assets/99309759-8355-4ee1-95c4-897f01c46720" width="75%">
38
- </td>
39
- </tr>
40
- <tr>
41
- <td align="center">
42
- zoom in and upward movement<br>(CFG=4, FS=6, step=50, ratio=0.8, scale=0.2)
43
- </td>
44
- <td align="center">
45
- downward movement and zoom-out<br>(CFG=4, FS=6, step=50, ratio=0.8, scale=0.2)
46
- </td>
47
- </tr>
48
- <tr>
49
- <td align="center">
50
- <img src="https://github.com/user-attachments/assets/aef4cc2e-fd7e-46db-82bc-a7e59aab5963" width="75%">
51
- </td>
52
- <td align="center">
53
- <img src="https://github.com/user-attachments/assets/f204992a-d729-492c-a663-85f9b80680f5" width="75%">
54
- </td>
55
- </tr>
56
- </table>
57
-
58
- ## ๐ŸŒŸ News and Todo List
59
-
60
- - ๐Ÿ”ฅ 25/03/17: Upload test metadata used in our paper to make easier evaluation.
61
- - ๐Ÿ”ฅ 25/02/15: Release demo of [RealCam-I2V](https://zgctroy.github.io/RealCam-I2V/) for real-world applications, code will be available at [repo](https://github.com/ZGCTroy/RealCam-I2V).
62
- - ๐Ÿ”ฅ 25/01/12: Release checkpoint of [CamI2V (512x320, 100k)](https://huggingface.co/MuteApo/CamI2V/blob/main/512_cami2v_100k.pt). We plan to release a more advanced model with longer training soon.
63
- - ๐Ÿ”ฅ 25/01/02: Release checkpoint of [CamI2V (512x320, 50k)](https://huggingface.co/MuteApo/CamI2V/blob/main/512_cami2v_50k.pt), which is suitable for research propose and comparison.
64
- - ๐Ÿ”ฅ 24/12/24: Integrate [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL) in gradio demo, you can now caption your own input image by this powerful VLM.
65
- - ๐Ÿ”ฅ 24/12/23: Release checkpoint of [CamI2V (256x256, 50k)](https://huggingface.co/MuteApo/CamI2V/blob/main/256_cami2v.pt).
66
- - ๐Ÿ”ฅ 24/12/16: Release reproduced non-official checkpoints of [MotionCtrl (256x256, 50k)](https://huggingface.co/MuteApo/CamI2V/blob/main/256_motionctrl.pt) and [CameraCtrl (256x256, 50k)](https://huggingface.co/MuteApo/CamI2V/blob/main/256_cameractrl.pt) on [DynamiCrafter](https://github.com/Doubiiu/DynamiCrafter).
67
- - ๐Ÿ”ฅ 24/12/09: Release training configs and scripts.
68
- - ๐Ÿ”ฅ 24/12/06: Release [dataset pre-process code](datasets) for RealEstate10K.
69
- - ๐Ÿ”ฅ 24/12/02: Release [evaluation code](evaluation) for RotErr, TransErr, CamMC and FVD.
70
- - ๐ŸŒฑ 24/11/16: Release model code of CamI2V for training and inference, including implementation for MotionCtrl and CameraCtrl.
71
-
72
- ## ๐Ÿ“ˆ Performance
73
-
74
- Measured under 256x256 resolution, 50k training steps, 25 DDIM steps, text-image CFG 7.5, camera CFG 1.0 (no camera CFG).
75
-
76
- | Method | RotErrโ†“ | TransErrโ†“ | CamMCโ†“ | FVDโ†“<br>(VideoGPT) | FVDโ†“<br>(StyleGAN) |
77
- | :------------ | :--------: | :--------: | :--------: | :----------------: | :----------------: |
78
- | DynamiCrafter | 3.3415 | 9.8024 | 11.625 | 106.02 | 92.196 |
79
- | MotionCtrl | 0.8636 | 2.5068 | 2.9536 | 70.820 | 60.363 |
80
- | CameraCtrl | 0.7064 | 1.9379 | 2.3070 | 66.713 | 57.644 |
81
- | CamI2V | **0.4120** | **1.3409** | **1.5291** | **62.439** | **53.361** |
82
-
83
- ### Inference Speed and GPU Memory
84
-
85
- | Method | # Parameters | GPU Memory | Generation Time<br>(RTX 3090) |
86
- | :------------ | :----------: | :--------: | :---------------------------: |
87
- | DynamiCrafter | 1.4 B | 11.14 GiB | 8.14 s |
88
- | MotionCtrl | + 63.4 M | 11.18 GiB | 8.27 s |
89
- | CameraCtrl | + 211 M | 11.56 GiB | 8.38 s |
90
- | CamI2V | + 261 M | 11.67 GiB | 10.3 s |
91
 
92
  ## โš™๏ธ Environment
93
 
94
  ### Quick Start
95
 
96
  ```shell
97
- conda create -n cami2v python=3.10
98
- conda activate cami2v
99
-
100
- conda install -y pytorch==2.4.1 torchvision==0.19.1 pytorch-cuda=12.1 -c pytorch -c nvidia
101
- conda install -y xformers -c xformers
102
  pip install -r requirements.txt
103
  ```
104
 
105
  ## ๐Ÿ’ซ Inference
106
 
107
- ### Download Model Checkpoints
108
-
109
- | Model | Resolution | Training Steps |
110
- | :--------- | :--------: | :--------------------------------------------------------------------------------------------------------------------------------------------------: |
111
- | CamI2V | 512x320 | [50k](https://huggingface.co/MuteApo/CamI2V/blob/main/512_cami2v_50k.pt), [100k](https://huggingface.co/MuteApo/CamI2V/blob/main/512_cami2v_100k.pt) |
112
- | CamI2V | 256x256 | [50k](https://huggingface.co/MuteApo/CamI2V/blob/main/256_cami2v.pt) |
113
- | CameraCtrl | 256x256 | [50k](https://huggingface.co/MuteApo/CamI2V/blob/main/256_cameractrl.pt) |
114
- | MotionCtrl | 256x256 | [50k](https://huggingface.co/MuteApo/CamI2V/blob/main/256_motionctrl.pt) |
115
-
116
- Currently we release 256x256 checkpoints with 50k training steps of DynamiCrafter-based CamI2V, CameraCtrl and MotionCtrl, which is suitable for research propose and comparison.
117
 
118
- We also release 512x320 checkpoints of our CamI2V with longer training, make possible higher resolution and more advanced camera-controlled video generation.
119
 
120
- Download above checkpoints and put under `ckpts` folder.
121
- Please edit `ckpt_path` in `configs/models.json` if you have a different model path.
122
-
123
- ### Download Qwen2-VL Captioner (Optional)
124
 
125
- Not required but recommend.
126
- It is used to caption your custom image in gradio demo for video generaion.
127
- We prefer the [AWQ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-AWQ) quantized version of Qwen2-VL due to speed and GPU memory.
128
 
129
- Download the pre-trained model and put under `pretrained_models` folder:
130
 
131
  ```shell
132
- โ”€โ”ฌโ”€ pretrained_models/
133
- โ””โ”€โ”€โ”€ Qwen2-VL-7B-Instruct-AWQ/
134
  ```
135
 
136
- ### Run Gradio Demo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
- ```shell
139
- python cami2v_gradio_app.py --use_qwenvl_captioner
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  ```
141
 
142
- Gradio may struggle to establish network connection, please re-try with `--use_host_ip`.
143
 
144
- ## ๐Ÿค— Related Repo
 
 
 
 
145
 
146
- [RealCam-I2V: https://github.com/ZGCTroy/RealCam-I2V](https://github.com/ZGCTroy/RealCam-I2V)
147
 
148
- [CameraCtrl: https://github.com/hehao13/CameraCtrl](https://github.com/hehao13/CameraCtrl)
 
 
149
 
150
- [MotionCtrl: https://github.com/TencentARC/MotionCtrl](https://github.com/TencentARC/MotionCtrl)
151
 
152
- [DynamiCrafter: https://github.com/Doubiiu/DynamiCrafter](https://github.com/Doubiiu/DynamiCrafter)
 
 
153
 
154
  ## ๐Ÿ—’๏ธ Citation
155
 
156
- ```
157
- @article{zheng2024cami2v,
158
- title={CamI2V: Camera-Controlled Image-to-Video Diffusion Model},
159
- author={Zheng, Guangcong and Li, Teng and Jiang, Rui and Lu, Yehao and Wu, Tao and Li, Xi},
160
- journal={arXiv preprint arXiv:2410.15957},
161
- year={2024}
162
- }
163
 
 
164
  @article{li2025realcam,
165
  title={RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control},
166
  author={Li, Teng and Zheng, Guangcong and Jiang, Rui and Zhan, Shuigen and Wu, Tao and Lu, Yehao and Lin, Yining and Li, Xi},
167
  journal={arXiv preprint arXiv:2502.10059},
168
  year={2025},
169
  }
 
 
 
 
 
 
 
170
  ```
 
3
  tags:
4
  - image-to-video
5
  - pytorch
6
+ pipeline_tag: image-to-video
7
+ library_name: diffusers
8
  ---
9
+
10
+ # RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
11
 
12
  <div align="center">
13
+ <a href="https://huggingface.co/papers/2502.10059"><img src="https://img.shields.io/static/v1?label=arXiv&message=2502.10059&color=b21b1b"></a>
14
+ <a href="https://zgctroy.github.io/RealCam-I2V"><img src="https://img.shields.io/static/v1?label=Project&message=Page&color=green"></a>
15
+ <a href="https://github.com/ZGCTroy/RealCam-I2V"><img src="https://img.shields.io/static/v1?label=GitHub&message=Code&color=blue"></a>
16
+ <a href="https://huggingface.co/MuteApo/RealCam-I2V"><img src="https://img.shields.io/static/v1?label=HuggingFace&message=Model&color=orange"></a>
17
+ </div>
18
+
19
+ ## Abstract
20
+ Recent advancements in camera-trajectory-guided image-to-video generation offer higher precision and better support for complex camera control compared to text-based approaches. However, they also introduce significant usability challenges, as users often struggle to provide precise camera parameters when working with arbitrary real-world images without knowledge of their depth nor scene scale. To address these real-world application issues, we propose RealCam-I2V, a novel diffusion-based video generation framework that integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step. During training, the reconstructed 3D scene enables scaling camera parameters from relative to metric scales, ensuring compatibility and scale consistency across diverse real-world images. In inference, RealCam-I2V offers an intuitive interface where users can precisely draw camera trajectories by dragging within the 3D scene. To further enhance precise camera control and scene consistency, we propose scene-constrained noise shaping, which shapes high-level noise and also allows the framework to maintain dynamic and coherent video generation in lower noise stages. RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images. We further enables applications like camera-controlled looping video generation and generative frame interpolation.
 
 
21
 
22
  ## ๐ŸŽฅ Gallery
23
+ <div align="center">
24
+ <a href='https://zgctroy.github.io/RealCam-I2V'><img src="https://zgctroy.github.io/RealCam-I2V/assets/demo.gif" alt="RealCam-I2V Demo GIF" style="width: 100%; max-width: 650px;"></a>
25
+ </div>
26
 
27
+ ## ๐ŸŒŸ News
28
+ - **25/07/05**: Release inference code and checkpoints of RealCam-I2V. We are still actively working on sanitizing the code. More updates of code and checkpoint will follow soon, please stay tuned!
29
+ - **25/06/26**: RealCam-I2V is accepted by ICCV 2025! ๐ŸŽ‰๐ŸŽ‰
30
+ - **25/05/18**: Release training code of RealCam-I2V on CogVideoX 1.5.
31
+ - **25/03/26**: Release our dataset [RealCam-Vid](https://huggingface.co/datasets/MuteApo/RealCam-Vid) v1 for metric-scale camera-controlled video generation!
32
+ - **25/02/18**: Initial commit of the project, we plan to release our DiT-based real-camera i2v models (e.g., CogVideoX) in this repo.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ## โš™๏ธ Environment
35
 
36
  ### Quick Start
37
 
38
  ```shell
39
+ apt install libgl1-mesa-glx libgl1-mesa-dri xvfb # for ubuntu
40
+ yum install -y mesa-libGL mesa-dri-drivers Xvfb. # for centos
41
+ conda install ffmpeg=7 -c conda-forge
 
 
42
  pip install -r requirements.txt
43
  ```
44
 
45
  ## ๐Ÿ’ซ Inference
46
 
47
+ ### Download Pretrained Models
 
 
 
 
 
 
 
 
 
48
 
49
+ Download and put under `pretrained` folder the pretrained weights of [CogVideoX1.5-5B-I2V](https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V), [Metric3D](https://huggingface.co/JUGGHM/Metric3D) and [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct).
50
 
51
+ ### Download Model Checkpoints
 
 
 
52
 
53
+ Download our weights of [RealCam-I2V](https://huggingface.co/MuteApo/RealCam-I2V) and put under `checkpoints` folder.
54
+ Please edit `demo/models.json` if you have a custom model path.
 
55
 
56
+ ### Run Gradio Demo
57
 
58
  ```shell
59
+ python gradio_app.py
 
60
  ```
61
 
62
+ ### Inference Code Example
63
+
64
+ ```python
65
+ from transformers import AutoModel, AutoProcessor
66
+ from PIL import Image
67
+ import torch
68
+ import cv2
69
+ import numpy as np
70
+
71
+ # Load RealCam-I2V model and processor
72
+ model_path = "MuteApo/RealCam-I2V" # Or your local path to the checkpoint
73
+ model = AutoModel.from_pretrained(
74
+ model_path,
75
+ torch_dtype=torch.float16, # Use torch.float32 for full precision
76
+ trust_remote_code=True,
77
+ )
78
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
79
+
80
+ # Move model to GPU
81
+ model.to("cuda")
82
+
83
+ # Prepare inputs
84
+ input_image_path = "./path/to/your/image.jpg" # Replace with your image path
85
+ input_image = Image.open(input_image_path).convert("RGB")
86
+
87
+ # Example camera trajectory (adjust as needed for your desired motion)
88
+ # This is a simplified example; full camera control involves more parameters
89
+ # See project page or original repo for detailed camera trajectory specifications.
90
+ # Here, a simple stationary camera for 16 frames as an illustration.
91
+ camera_trajectory = {
92
+ "center": [(0, 0, 0) for _ in range(16)], # (x, y, z) position
93
+ "look_at": [(0, 0, 1) for _ in range(16)], # (x, y, z) point camera looks at
94
+ "up": [(0, 1, 0) for _ in range(16)], # (x, y, z) up vector
95
+ "fovy": [45.0 for _ in range(16)], # Field of view in degrees
96
+ }
97
 
98
+ # Process inputs
99
+ inputs = processor(
100
+ images=input_image,
101
+ camera_trajectory=camera_trajectory,
102
+ return_tensors="pt"
103
+ )
104
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
105
+
106
+ # Generate video
107
+ with torch.no_grad():
108
+ video_frames = model.generate(**inputs, num_inference_steps=50).cpu().numpy()
109
+
110
+ # Save video frames as a GIF or MP4
111
+ output_video_path = "./output_video.gif" # or .mp4
112
+ # Assuming video_frames are in [B, C, H, W] range [0,1]
113
+ # Convert to [B, H, W, C] and scale to [0, 255] for saving
114
+ video_frames = (video_frames * 255).astype(np.uint8).transpose(0, 2, 3, 1)
115
+
116
+ # Example to save as GIF using imageio
117
+ from imageio import mimsave
118
+ mimsave(output_video_path, video_frames, fps=8) # Adjust fps as needed
119
+
120
+ print(f"Video saved to {output_video_path}")
121
  ```
122
 
123
+ ## ๐Ÿš€ Training
124
 
125
+ ### Prepare Dataset
126
+
127
+ Please access [RealCam-Vid](https://github.com/ZGCTroy/RealCam-Vid) and download our dataset for training RealCam-I2V-CogVideoX-1.5. Please unzip all contents in `data` folder.
128
+
129
+ ### Launch
130
 
131
+ Edit example training script `accelerate_train.sh` if necessary and launch training by:
132
 
133
+ ```shell
134
+ bash accelerate_train.sh
135
+ ```
136
 
137
+ ## ๐Ÿค— Related Repo
138
 
139
+ - Our dataset, the first open-sourced, combining diverse scene dynamics with metric-scale camera trajectories, is available at [RealCam-Vid](https://github.com/ZGCTroy/RealCam-Vid).
140
+ - Our previous work at [CamI2V](https://github.com/ZGCTroy/CamI2V).
141
+ - We have borrowed a lot of code from the original [CogVideoX](https://github.com/THUDM/CogVideo) repository.
142
 
143
  ## ๐Ÿ—’๏ธ Citation
144
 
145
+ If you find this work useful, please consider citing our papers:
 
 
 
 
 
 
146
 
147
+ ```bibtex
148
  @article{li2025realcam,
149
  title={RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control},
150
  author={Li, Teng and Zheng, Guangcong and Jiang, Rui and Zhan, Shuigen and Wu, Tao and Lu, Yehao and Lin, Yining and Li, Xi},
151
  journal={arXiv preprint arXiv:2502.10059},
152
  year={2025},
153
  }
154
+
155
+ @article{zheng2024cami2v,
156
+ title={CamI2V: Camera-Controlled Image-to-Video Diffusion Model},
157
+ author={Zheng, Guangcong and Li, Teng and Jiang, Rui and Lu, Yehao and Wu, Tao and Li, Xi},
158
+ journal={arXiv preprint arXiv:2410.15957},
159
+ year={2024}
160
+ }
161
  ```