EvanEternal commited on
Commit
4555486
·
verified ·
1 Parent(s): e88d449

Upload 8 files

Browse files
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Yixuan
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README-copy.md ADDED
@@ -0,0 +1,308 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ReCamMaster: Camera-Controlled Generative Rendering from A Single Video (ICCV'25 Oral, Best Paper Finalist)
2
+
3
+ <div align="center">
4
+ <div align="center" style="margin-top: 0px; margin-bottom: 0px;">
5
+ <img src=https://github.com/user-attachments/assets/81ccf80e-f4b6-4a3d-b47a-e9c2ce14e34f width="30%"/>
6
+ </div>
7
+
8
+ ### [<a href="https://arxiv.org/abs/2503.11647" target="_blank">arXiv</a>] [<a href="https://jianhongbai.github.io/ReCamMaster/" target="_blank">Project Page</a>] [<a href="https://huggingface.co/datasets/KwaiVGI/MultiCamVideo-Dataset" target="_blank">Dataset</a>]
9
+ _**[Jianhong Bai<sup>1*</sup>](https://jianhongbai.github.io/), [Menghan Xia<sup>2†</sup>](https://menghanxia.github.io/), [Xiao Fu<sup>3</sup>](https://fuxiao0719.github.io/), [Xintao Wang<sup>2</sup>](https://xinntao.github.io/), [Lianrui Mu<sup>1</sup>](https://scholar.google.com/citations?user=dCik-2YAAAAJ&hl=en), [Jinwen Cao<sup>2</sup>](https://openreview.net/profile?id=~Jinwen_Cao1), <br>[Zuozhu Liu<sup>1</sup>](https://person.zju.edu.cn/en/lzz), [Haoji Hu<sup>1†</sup>](https://person.zju.edu.cn/en/huhaoji), [Xiang Bai<sup>4</sup>](https://scholar.google.com/citations?user=UeltiQ4AAAAJ&hl=en), [Pengfei Wan<sup>2</sup>](https://scholar.google.com/citations?user=P6MraaYAAAAJ&hl=en), [Di Zhang<sup>2</sup>](https://openreview.net/profile?id=~Di_ZHANG3)**_
10
+ <br>
11
+ (*Work done during an internship at KwaiVGI, Kuaishou Technology †corresponding authors)
12
+
13
+ <sup>1</sup>Zhejiang University, <sup>2</sup>Kuaishou Technology, <sup>3</sup>CUHK, <sup>4</sup>HUST.
14
+
15
+ </div>
16
+
17
+ **Important Note:** This open-source repository is intended to provide a reference implementation. Due to the difference in the underlying T2V model's performance, the open-source version may not achieve the same performance as the model in our paper. If you'd like to use the best version of ReCamMaster, please upload your video to [this link](https://docs.google.com/forms/d/e/1FAIpQLSezOzGPbm8JMXQDq6EINiDf6iXn7rV4ozj6KcbQCSAzE8Vsnw/viewform?usp=dialog). Additionally, we are working on developing an online trial website. Please stay tuned to updates on the [Kling website](https://app.klingai.com/global/).
18
+
19
+ ## 🔥 Updates
20
+ - __[2025.04.15]__: Please feel free to explore our related work, [SynCamMaster](https://github.com/KwaiVGI/SynCamMaster).
21
+ - __[2025.04.09]__: Release the [training and inference code](https://github.com/KwaiVGI/ReCamMaster?tab=readme-ov-file#%EF%B8%8F-code-recammaster--wan21-inference--training), [model checkpoint](https://huggingface.co/KwaiVGI/ReCamMaster-Wan2.1/blob/main/step20000.ckpt).
22
+ - __[2025.03.31]__: Release the [MultiCamVideo Dataset](https://huggingface.co/datasets/KwaiVGI/MultiCamVideo-Dataset).
23
+ - __[2025.03.31]__: We have sent the inference results to the first 1000 trial users.
24
+ - __[2025.03.17]__: Release the [project page](https://jianhongbai.github.io/ReCamMaster/) and the [try out link](https://docs.google.com/forms/d/e/1FAIpQLSezOzGPbm8JMXQDq6EINiDf6iXn7rV4ozj6KcbQCSAzE8Vsnw/viewform?usp=dialog).
25
+
26
+
27
+
28
+
29
+ ## 📖 Introduction
30
+
31
+ **TL;DR:** We propose ReCamMaster to re-capture in-the-wild videos with novel camera trajectories, achieved through our proposed simple-and-effective video conditioning scheme. We also release a multi-camera synchronized video [dataset](https://huggingface.co/datasets/KwaiVGI/MultiCamVideo-Dataset) rendered with Unreal Engine 5. <br>
32
+
33
+ https://github.com/user-attachments/assets/52455e86-1adb-458d-bc37-4540a65a60d4
34
+
35
+ ## 🚀 Trail: Try ReCamMaster with Your Own Videos
36
+
37
+ **Update:** We are actively processing the videos uploaded by users. So far, we have sent the inference results to the email addresses of the first **1500** testers. You should receive an email titled "Inference Results of ReCamMaster" from either jianhongbai@zju.edu.cn or cpurgicn@gmail.com. Please also check your spam folder, and let us know if you haven't received the email after a long time. If you enjoyed the videos we created, please consider giving us a star 🌟.
38
+
39
+ **You can try out our ReCamMaster by uploading your own video to [this link](https://docs.google.com/forms/d/e/1FAIpQLSezOzGPbm8JMXQDq6EINiDf6iXn7rV4ozj6KcbQCSAzE8Vsnw/viewform?usp=dialog), which will generate a video with camera movements along a new trajectory.** We will send the mp4 file generated by ReCamMaster to your inbox as soon as possible. For camera movement trajectories, we offer 10 basic camera trajectories as follows:
40
+
41
+ | Index | Basic Trajectory |
42
+ |-------------------|-----------------------------|
43
+ | 1 | Pan Right |
44
+ | 2 | Pan Left |
45
+ | 3 | Tilt Up |
46
+ | 4 | Tilt Down |
47
+ | 5 | Zoom In |
48
+ | 6 | Zoom Out |
49
+ | 7 | Translate Up (with rotation) |
50
+ | 8 | Translate Down (with rotation) |
51
+ | 9 | Arc Left (with rotation) |
52
+ | 10 | Arc Right (with rotation) |
53
+
54
+ If you would like to use ReCamMaster as a baseline and need qualitative or quantitative comparisons, please feel free to drop an email to [jianhongbai@zju.edu.cn](mailto:jianhongbai@zju.edu.cn). We can assist you with batch inference of our model.
55
+
56
+ ## ⚙️ Code: ReCamMaster + Wan2.1 (Inference & Training)
57
+ The model utilized in our paper is an internally developed T2V model, not [Wan2.1](https://github.com/Wan-Video/Wan2.1). Due to company policy restrictions, we are unable to open-source the model used in the paper. Consequently, we migrated ReCamMaster to Wan2.1 to validate the effectiveness of our method. Due to differences in the underlying T2V model, you may not achieve the same results as demonstrated in the demo.
58
+ ### Inference
59
+ Step 1: Set up the environment
60
+
61
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) requires Rust and Cargo to compile extensions. You can install them using the following command:
62
+ ```shell
63
+ curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
64
+ . "$HOME/.cargo/env"
65
+ ```
66
+
67
+ Install [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio):
68
+ ```shell
69
+ git clone https://github.com/KwaiVGI/ReCamMaster.git
70
+ cd ReCamMaster
71
+ pip install -e .
72
+ ```
73
+
74
+ Step 2: Download the pretrained checkpoints
75
+ 1. Download the pre-trained Wan2.1 models
76
+
77
+ ```shell
78
+ cd ReCamMaster
79
+ python download_wan2.1.py
80
+ ```
81
+ 2. Download the pre-trained ReCamMaster checkpoint
82
+
83
+ Please download from [huggingface](https://huggingface.co/KwaiVGI/ReCamMaster-Wan2.1/blob/main/step20000.ckpt) and place it in ```models/ReCamMaster/checkpoints```.
84
+
85
+ Step 3: Test the example videos
86
+ ```shell
87
+ python inference_recammaster.py --cam_type 1
88
+ ```
89
+
90
+ Step 4: Test your own videos
91
+
92
+ If you want to test your own videos, you need to prepare your test data following the structure of the ```example_test_data``` folder. This includes N mp4 videos, each with at least 81 frames, and a ```metadata.csv``` file that stores their paths and corresponding captions. You can refer to the [Prompt Extension section](https://github.com/Wan-Video/Wan2.1?tab=readme-ov-file#2-using-prompt-extension) in Wan2.1 for guidance on preparing video captions.
93
+
94
+ ```shell
95
+ python inference_recammaster.py --cam_type 1 --dataset_path path/to/your/data
96
+ ```
97
+
98
+ We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing.
99
+
100
+ | cam_type | Trajectory |
101
+ |-------------------|-----------------------------|
102
+ | 1 | Pan Right |
103
+ | 2 | Pan Left |
104
+ | 3 | Tilt Up |
105
+ | 4 | Tilt Down |
106
+ | 5 | Zoom In |
107
+ | 6 | Zoom Out |
108
+ | 7 | Translate Up (with rotation) |
109
+ | 8 | Translate Down (with rotation) |
110
+ | 9 | Arc Left (with rotation) |
111
+ | 10 | Arc Right (with rotation) |
112
+
113
+ ### Training
114
+
115
+ Step 1: Set up the environment
116
+
117
+ ```shell
118
+ pip install lightning pandas websockets
119
+ ```
120
+
121
+ Step 2: Prepare the training dataset
122
+
123
+ 1. Download the [MultiCamVideo dataset](https://huggingface.co/datasets/KwaiVGI/MultiCamVideo-Dataset).
124
+
125
+ 2. Extract VAE features
126
+
127
+ ```shell
128
+ CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task data_process --dataset_path path/to/the/MultiCamVideo/Dataset --output_path ./models --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth" --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth" --tiled --num_frames 81 --height 480 --width 832 --dataloader_num_workers 2
129
+ ```
130
+
131
+ 3. Generate Captions for Each Video
132
+
133
+ You can use video caption tools like [LLaVA](https://github.com/haotian-liu/LLaVA) to generate captions for each video and store them in the ```metadata.csv``` file.
134
+
135
+ Step 3: Training
136
+ ```shell
137
+ CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task train --dataset_path recam_train_data --output_path ./models/train --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" --steps_per_epoch 8000 --max_epochs 100 --learning_rate 1e-4 --accumulate_grad_batches 1 --use_gradient_checkpointing --dataloader_num_workers 4
138
+ ```
139
+ We do not explore the optimal set of hyper-parameters and train with a batch size of 1 on each GPU. You may achieve better model performance by adjusting hyper-parameters such as the learning rate and increasing the batch size.
140
+
141
+ Step 4: Test the model
142
+
143
+ ```shell
144
+ python inference_recammaster.py --cam_type 1 --ckpt_path path/to/the/checkpoint
145
+ ```
146
+
147
+ ## 📷 Dataset: MultiCamVideo Dataset
148
+ ### 1. Dataset Introduction
149
+
150
+ **TL;DR:** The MultiCamVideo Dataset is a multi-camera synchronized video dataset rendered using Unreal Engine 5. It includes synchronized multi-camera videos and their corresponding camera trajectories. The MultiCamVideo Dataset can be valuable in fields such as camera-controlled video generation, synchronized video production, and 3D/4D reconstruction. If you are looking for synchronized videos captured with stationary cameras, please explore our [SynCamVideo Dataset](https://github.com/KwaiVGI/SynCamMaster).
151
+
152
+ https://github.com/user-attachments/assets/6fa25bcf-1136-43be-8110-b527638874d4
153
+
154
+ The MultiCamVideo Dataset is a multi-camera synchronized video dataset rendered using Unreal Engine 5. It includes synchronized multi-camera videos and their corresponding camera trajectories.
155
+ It consists of 13.6K different dynamic scenes, each captured by 10 cameras, resulting in a total of 136K videos. Each dynamic scene is composed of four elements: {3D environment, character, animation, camera}. Specifically, we use animation to drive the character,
156
+ and position the animated character within the 3D environment. Then, Time-synchronized cameras are set up to move along predefined trajectories to render the multi-camera video data.
157
+ <p align="center">
158
+ <img src="https://github.com/user-attachments/assets/107c9607-e99b-4493-b715-3e194fcb3933" alt="Example Image" width="70%">
159
+ </p>
160
+
161
+ **3D Environment:** We collect 37 high-quality 3D environments assets from [Fab](https://www.fab.com). To minimize the domain gap between rendered data and real-world videos, we primarily select visually realistic 3D scenes, while choosing a few stylized or surreal 3D scenes as a supplement. To ensure data diversity, the selected scenes cover a variety of indoor and outdoor settings, such as city streets, shopping malls, cafes, office rooms, and the countryside.
162
+
163
+ **Character:** We collect 66 different human 3D models as characters from [Fab](https://www.fab.com) and [Mixamo](https://www.mixamo.com).
164
+
165
+ **Animation:** We collect 93 different animations from [Fab](https://www.fab.com) and [Mixamo](https://www.mixamo.com), including common actions such as waving, dancing, and cheering. We use these animations to drive the collected characters and create diverse datasets through various combinations.
166
+
167
+ **Camera:** To ensure camera movements are diverse and closely resemble real-world distributions, we create a wide range of camera trajectories and parameters to cover various situations. To achieve this by designing rules to batch-generate random camera starting positions and movement trajectories:
168
+
169
+ 1. Camera Starting Position.
170
+
171
+ We take the character's position as the center of a hemisphere with a radius of {3m, 5m, 7m, 10m} based on the size of the 3D scene and randomly sample within this range as the camera's starting point, ensuring the closest distance to the character is greater than 0.5m and the pitch angle is within 45 degrees.
172
+
173
+ 2. Camera Trajectories.
174
+
175
+ - **Pan & Tilt**:
176
+ The camera rotation angles are randomly selected within the range, with pan angles ranging from 5 to 45 degrees and tilt angles ranging from 5 to 30 degrees, with directions randomly chosen left/right or up/down.
177
+
178
+ - **Basic Translation**:
179
+ The camera translates along the positive and negative directions of the xyz axes, with movement distances randomly selected within the range of $[\frac{1}{4}, 1] \times \text{distance2character}$.
180
+
181
+ - **Basic Arc Trajectory**:
182
+ The camera moves along an arc, with rotation angles randomly selected within the range of 15 to 75 degrees.
183
+
184
+ - **Random Trajectories**:
185
+ 1-3 points are sampled in space, and the camera moves from the initial position through these points as the movement trajectory, with the total movement distance randomly selected within the range of $[\frac{1}{4}, 1] \times \text{distance2character}$. The polyline is smoothed to make the movement more natural.
186
+
187
+ - **Static Camera**:
188
+ The camera does not translate or rotate during shooting, maintaining a fixed position.
189
+
190
+ 3. Camera Movement Speed.
191
+
192
+ To further enhance the diversity of trajectories, 50% of the training data uses constant-speed camera trajectories, while the other 50% uses variable-speed trajectories generated by nonlinear functions. Consider a camera trajectory with a total of $f$ frames, starting at location $L_{start}$ and ending at position $L_{end}$. The location at the $i$-th frame is given by:
193
+ ```math
194
+ L_i = L_{start} + (L_{end} - L_{start}) \cdot \left( \frac{1 - \exp(-a \cdot i/f)}{1 - \exp(-a)} \right),
195
+ ```
196
+ where $a$ is an adjustable parameter to control the trajectory speed. When $a > 0$, the trajectory starts fast and then slows down; when $a < 0$, the trajectory starts slow and then speeds up. The larger the absolute value of $a$, the more drastic the change.
197
+
198
+ 4. Camera Parameters.
199
+
200
+ We chose four set of camera parameters: {focal=18mm, aperture=10}, {focal=24mm, aperture=5}, {focal=35mm, aperture=2.4} and {focal=50mm, aperture=2.4}.
201
+
202
+ ### 2. Statistics and Configurations
203
+
204
+ Dataset Statistics:
205
+
206
+ | Number of Dynamic Scenes | Camera per Scene | Total Videos |
207
+ |:------------------------:|:----------------:|:------------:|
208
+ | 13,600 | 10 | 136,000 |
209
+
210
+ Video Configurations:
211
+
212
+ | Resolution | Frame Number | FPS |
213
+ |:-----------:|:------------:|:------------------------:|
214
+ | 1280x1280 | 81 | 15 |
215
+
216
+ Note: You can use 'center crop' to adjust the video's aspect ratio to fit your video generation model, such as 16:9, 9:16, 4:3, or 3:4.
217
+
218
+ Camera Configurations:
219
+
220
+ | Focal Length | Aperture | Sensor Height | Sensor Width |
221
+ |:-----------------------:|:------------------:|:-------------:|:------------:|
222
+ | 18mm, 24mm, 35mm, 50mm | 10.0, 5.0, 2.4 | 23.76mm | 23.76mm |
223
+
224
+
225
+
226
+ ### 3. File Structure
227
+ ```
228
+ MultiCamVideo-Dataset
229
+ ├── train
230
+ │ ├── f18_aperture10
231
+ │ │ ├── scene1 # one dynamic scene
232
+ │ │ │ ├── videos
233
+ │ │ │ │ ├── cam01.mp4 # synchronized 81-frame videos at 1280x1280 resolution
234
+ │ │ │ │ ├── cam02.mp4
235
+ │ │ │ │ ├── ...
236
+ │ │ │ │ └── cam10.mp4
237
+ │ │ │ └── cameras
238
+ │ │ │ └── camera_extrinsics.json # 81-frame camera extrinsics of the 10 cameras
239
+ │ │ ├── ...
240
+ │ │ └── scene3400
241
+ │ ├── f24_aperture5
242
+ │ │ ├── scene1
243
+ │ │ ├── ...
244
+ │ │ └── scene3400
245
+ │ ├── f35_aperture2.4
246
+ │ │ ├── scene1
247
+ │ │ ├── ...
248
+ │ │ └── scene3400
249
+ │ └── f50_aperture2.4
250
+ │ ├── scene1
251
+ │ ├── ...
252
+ │ └── scene3400
253
+ └── val
254
+ └── 10basic_trajectories
255
+ ├── videos
256
+ │ ├── cam01.mp4 # example videos corresponding to the validation cameras
257
+ │ ├── cam02.mp4
258
+ │ ├── ...
259
+ │ └── cam10.mp4
260
+ └── cameras
261
+ └── camera_extrinsics.json # 10 different trajectories for validation
262
+ ```
263
+
264
+ ### 3. Useful scripts
265
+ - Data Extraction
266
+ ```bash
267
+ cat MultiCamVideo-Dataset.part* > MultiCamVideo-Dataset.tar.gz
268
+ tar -xzvf MultiCamVideo-Dataset.tar.gz
269
+ ```
270
+ - Camera Visualization
271
+ ```python
272
+ python vis_cam.py
273
+ ```
274
+
275
+ The visualization script is modified from [CameraCtrl](https://github.com/hehao13/CameraCtrl/blob/main/tools/visualize_trajectory.py), thanks for their inspiring work.
276
+
277
+ <p align="center">
278
+ <img src="https://github.com/user-attachments/assets/f9cf342d-2fb3-40ef-a7be-edafb5775004" alt="Example Image" width="40%">
279
+ </p>
280
+
281
+ ## 🤗 Awesome Related Works
282
+ Feel free to explore these outstanding related works, including but not limited to:
283
+
284
+ [GCD](https://gcd.cs.columbia.edu/): GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video.
285
+
286
+ [ReCapture](https://generative-video-camera-controls.github.io/): a method for generating new videos with novel camera trajectories from a single user-provided video.
287
+
288
+ [Trajectory Attention](https://xizaoqu.github.io/trajattn/): Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing.
289
+
290
+ [GS-DiT](https://wkbian.github.io/Projects/GS-DiT/): GS-DiT provides 4D video control for a single monocular video.
291
+
292
+ [Diffusion as Shader](https://igl-hkust.github.io/das/): a versatile video generation control model for various tasks.
293
+
294
+ [TrajectoryCrafter](https://trajectorycrafter.github.io/): TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video.
295
+
296
+ [GEN3C](https://research.nvidia.com/labs/toronto-ai/GEN3C/): a generative video model with precise Camera Control and temporal 3D Consistency.
297
+
298
+ ## 🌟 Citation
299
+
300
+ Please leave us a star 🌟 and cite our paper if you find our work helpful.
301
+ ```
302
+ @article{bai2025recammaster,
303
+ title={ReCamMaster: Camera-Controlled Generative Rendering from A Single Video},
304
+ author={Bai, Jianhong and Xia, Menghan and Fu, Xiao and Wang, Xintao and Mu, Lianrui and Cao, Jinwen and Liu, Zuozhu and Hu, Haoji and Bai, Xiang and Wan, Pengfei and others},
305
+ journal={arXiv preprint arXiv:2503.11647},
306
+ year={2025}
307
+ }
308
+ ```
README.md CHANGED
@@ -1,3 +1,368 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Astra<img src="./assets/images/logo.png" alt="logo" style="height: 1em; vertical-align: baseline; margin: 0 0.1em;">: General Interactive World Model with Autoregressive Denoising
2
+
3
+ <div align="center">
4
+ <div align="center" style="margin-top: 0px; margin-bottom: -30px;">
5
+ <img src="./assets/images/logo-text.png" width="30%"/>
6
+ </div>
7
+
8
+ ### [<a href="https://arxiv.org/abs/2503.11647" target="_blank">arXiv</a>] [<a href="https://eternalevan.github.io/Astra-project/" target="_blank">Project Page</a>]
9
+ **[Yixuan Zhu<sup>1</sup>](https://jianhongbai.github.io/), [Jiaqi Feng<sup>1</sup>](https://menghanxia.github.io/), [Wenzhao Zheng<sup>1 †</sup>](https://fuxiao0719.github.io/), [Yuan Gao<sup>2</sup>](https://xinntao.github.io/), [Xin Tao<sup>2</sup>](https://scholar.google.com/citations?user=dCik-2YAAAAJ&hl=en), [Pengfei Wan<sup>2</sup>](https://openreview.net/profile?id=~Jinwen_Cao1), [Jie Zhou <sup>1</sup>](https://person.zju.edu.cn/en/lzz), [Jiwen Lu<sup>1</sup>](https://person.zju.edu.cn/en/huhaoji)**
10
+ <!-- <br> -->
11
+ (*Work done during an internship at Kuaishou Technology,
12
+ † Project leader)
13
+
14
+ <sup>1</sup>Tsinghua University, <sup>2</sup>Kuaishou Technology.
15
+ </div>
16
+
17
+ ## 🔥 Updates
18
+ - __[2025.11.17]__: Release the [project page](https://eternalevan.github.io/Astra-project/).
19
+ - __[2025.12.09]__: Release the training and inference code, model checkpoint.
20
+
21
+ ## 🎯 TODO List
22
+
23
+ - [ ] **Release full inference pipelines** for additional scenarios:
24
+ - [ ] 🚗 Autonomous driving
25
+ - [ ] 🤖 Robotic manipulation
26
+ - [ ] 🛸 Drone navigation / exploration
27
+
28
+
29
+ - [ ] **Open-source training scripts**:
30
+ - [ ] ⬆️ Action-conditioned autoregressive denoising training
31
+ - [ ] 🔄 Multi-scenario joint training pipeline
32
+
33
+ - [ ] **Release dataset preprocessing tools**
34
+
35
+ - [ ] **Provide unified evaluation toolkit**
36
+ ## 📖 Introduction
37
+
38
+ **TL;DR:** Astra is an **interactive world model** that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.
39
+
40
+ ## Gallery
41
+
42
+ ### Astra+Wan2.1
43
+
44
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
45
+ <tr>
46
+ <td>
47
+ <video src="https://github.com/user-attachments/assets/715a5b66-3966-4923-aa00-02315fb07761"
48
+ style="width:100%; height:180px; object-fit:cover;"
49
+ controls autoplay loop muted></video>
50
+ </td>
51
+ <td>
52
+ <video src="https://github.com/user-attachments/assets/1451947e-1851-4b57-a666-a44ffea7b10c"
53
+ style="width:100%; height:180px; object-fit:cover;"
54
+ controls autoplay loop muted></video>
55
+ </td>
56
+ <td>
57
+ <video src="https://github.com/user-attachments/assets/c7156c4d-d51d-493c-995e-5113c3d49abb"
58
+ style="width:100%; height:180px; object-fit:cover;"
59
+ controls autoplay loop muted></video>
60
+ </td>
61
+ <td>
62
+ <video src="https://github.com/user-attachments/assets/f7550916-e224-497a-b0b9-84479607c962"
63
+ style="width:100%; height:180px; object-fit:cover;"
64
+ controls autoplay loop muted></video>
65
+ </td>
66
+ </tr>
67
+
68
+ <tr>
69
+ <td>
70
+ <video src="https://github.com/user-attachments/assets/d899d704-c706-4e64-a24b-eea174d2173d"
71
+ style="width:100%; height:180px; object-fit:cover;"
72
+ controls autoplay loop muted></video>
73
+ </td>
74
+ <td>
75
+ <video src="https://github.com/user-attachments/assets/c1d8beb2-3102-468a-8019-624d89fba125"
76
+ style="width:100%; height:180px; object-fit:cover;"
77
+ controls autoplay loop muted></video>
78
+ </td>
79
+ <td>
80
+ <video src="https://github.com/user-attachments/assets/2aabc10b-f945-4d9d-b24a-baed17fcfe14"
81
+ style="width:100%; height:180px; object-fit:cover;"
82
+ controls autoplay loop muted></video>
83
+ </td>
84
+ <td>
85
+ <video src="https://github.com/user-attachments/assets/5c03e6ae-0fc2-4e09-a5b5-f37d04e7bbf8"
86
+ style="width:100%; height:180px; object-fit:cover;"
87
+ controls autoplay loop muted></video>
88
+ </td>
89
+ </tr>
90
+ </table>
91
+
92
+ <!-- ## 🚀 Trail: Try ReCamMaster with Your Own Videos
93
+
94
+ **Update:** We are actively processing the videos uploaded by users. So far, we have sent the inference results to the email addresses of the first **1180** testers. You should receive an email titled "Inference Results of ReCamMaster" from either jianhongbai@zju.edu.cn or cpurgicn@gmail.com. Please also check your spam folder, and let us know if you haven't received the email after a long time. If you enjoyed the videos we created, please consider giving us a star 🌟.
95
+
96
+ **You can try out our ReCamMaster by uploading your own video to [this link](https://docs.google.com/forms/d/e/1FAIpQLSezOzGPbm8JMXQDq6EINiDf6iXn7rV4ozj6KcbQCSAzE8Vsnw/viewform?usp=dialog), which will generate a video with camera movements along a new trajectory.** We will send the mp4 file generated by ReCamMaster to your inbox as soon as possible. For camera movement trajectories, we offer 10 basic camera trajectories as follows:
97
+
98
+ | Index | Basic Trajectory |
99
+ |-------------------|-----------------------------|
100
+ | 1 | Pan Right |
101
+ | 2 | Pan Left |
102
+ | 3 | Tilt Up |
103
+ | 4 | Tilt Down |
104
+ | 5 | Zoom In |
105
+ | 6 | Zoom Out |
106
+ | 7 | Translate Up (with rotation) |
107
+ | 8 | Translate Down (with rotation) |
108
+ | 9 | Arc Left (with rotation) |
109
+ | 10 | Arc Right (with rotation) |
110
+
111
+ If you would like to use ReCamMaster as a baseline and need qualitative or quantitative comparisons, please feel free to drop an email to [jianhongbai@zju.edu.cn](mailto:jianhongbai@zju.edu.cn). We can assist you with batch inference of our model. -->
112
+
113
+ ## ⚙️ Code: Astra + Wan2.1 (Inference & Training)
114
+ Astra is built upon [Wan2.1-1.3B](https://github.com/Wan-Video/Wan2.1), a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below:
115
+
116
+ ### Inference
117
+ Step 1: Set up the environment
118
+
119
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) requires Rust and Cargo to compile extensions. You can install them using the following command:
120
+ ```shell
121
+ curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
122
+ . "$HOME/.cargo/env"
123
+ ```
124
+
125
+ Install [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio):
126
+ ```shell
127
+ git clone https://github.com/EternalEvan/Astra.git
128
+ cd Astra
129
+ pip install -e .
130
+ ```
131
+
132
+ Step 2: Download the pretrained checkpoints
133
+ 1. Download the pre-trained Wan2.1 models
134
+
135
+ ```shell
136
+ cd script
137
+ python download_wan2.1.py
138
+ ```
139
+ 2. Download the pre-trained Astra checkpoint
140
+
141
+ Please download from [huggingface](https://huggingface.co/wjque/lyra/blob/main/diffusion_pytorch_model.ckpt) and place it in ```models/Astra/checkpoints```.
142
+
143
+ Step 3: Test the example videos
144
+ ```shell
145
+ python inference_astra.py --cam_type 1
146
+ ```
147
+
148
+ Step 4: Test your own videos
149
+
150
+ If you want to test your own videos, you need to prepare your test data following the structure of the ```example_test_data``` folder. This includes N mp4 videos, each with at least 81 frames, and a ```metadata.csv``` file that stores their paths and corresponding captions. You can refer to the [Prompt Extension section](https://github.com/Wan-Video/Wan2.1?tab=readme-ov-file#2-using-prompt-extension) in Wan2.1 for guidance on preparing video captions.
151
+
152
+ ```shell
153
+ python inference_astra.py --cam_type 1 --dataset_path path/to/your/data
154
+ ```
155
+
156
+ We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing.
157
+
158
+ | cam_type | Trajectory |
159
+ |-------------------|-----------------------------|
160
+ | 1 | Pan Right |
161
+ | 2 | Pan Left |
162
+ | 3 | Tilt Up |
163
+ | 4 | Tilt Down |
164
+ | 5 | Zoom In |
165
+ | 6 | Zoom Out |
166
+ | 7 | Translate Up (with rotation) |
167
+ | 8 | Translate Down (with rotation) |
168
+ | 9 | Arc Left (with rotation) |
169
+ | 10 | Arc Right (with rotation) |
170
+
171
+ ### Training
172
+
173
+ Step 1: Set up the environment
174
+
175
+ ```shell
176
+ pip install lightning pandas websockets
177
+ ```
178
+
179
+ Step 2: Prepare the training dataset
180
+
181
+ 1. Download the [MultiCamVideo dataset](https://huggingface.co/datasets/KwaiVGI/MultiCamVideo-Dataset).
182
+
183
+ 2. Extract VAE features
184
+
185
+ ```shell
186
+ CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task data_process --dataset_path path/to/the/MultiCamVideo/Dataset --output_path ./models --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth" --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth" --tiled --num_frames 81 --height 480 --width 832 --dataloader_num_workers 2
187
+ ```
188
+
189
+ 3. Generate Captions for Each Video
190
+
191
+ You can use video caption tools like [LLaVA](https://github.com/haotian-liu/LLaVA) to generate captions for each video and store them in the ```metadata.csv``` file.
192
+
193
+ Step 3: Training
194
+ ```shell
195
+ CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task train --dataset_path recam_train_data --output_path ./models/train --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" --steps_per_epoch 8000 --max_epochs 100 --learning_rate 1e-4 --accumulate_grad_batches 1 --use_gradient_checkpointing --dataloader_num_workers 4
196
+ ```
197
+ We do not explore the optimal set of hyper-parameters and train with a batch size of 1 on each GPU. You may achieve better model performance by adjusting hyper-parameters such as the learning rate and increasing the batch size.
198
+
199
+ Step 4: Test the model
200
+
201
+ ```shell
202
+ python inference_recammaster.py --cam_type 1 --ckpt_path path/to/the/checkpoint
203
+ ```
204
+
205
+ <!-- ## 📷 Dataset: MultiCamVideo Dataset
206
+ ### 1. Dataset Introduction
207
+
208
+ **TL;DR:** The MultiCamVideo Dataset is a multi-camera synchronized video dataset rendered using Unreal Engine 5. It includes synchronized multi-camera videos and their corresponding camera trajectories. The MultiCamVideo Dataset can be valuable in fields such as camera-controlled video generation, synchronized video production, and 3D/4D reconstruction.
209
+
210
+ https://github.com/user-attachments/assets/6fa25bcf-1136-43be-8110-b527638874d4
211
+
212
+ The MultiCamVideo Dataset is a multi-camera synchronized video dataset rendered using Unreal Engine 5. It includes synchronized multi-camera videos and their corresponding camera trajectories.
213
+ It consists of 13.6K different dynamic scenes, each captured by 10 cameras, resulting in a total of 136K videos. Each dynamic scene is composed of four elements: {3D environment, character, animation, camera}. Specifically, we use animation to drive the character,
214
+ and position the animated character within the 3D environment. Then, Time-synchronized cameras are set up to move along predefined trajectories to render the multi-camera video data.
215
+ <p align="center">
216
+ <img src="https://github.com/user-attachments/assets/107c9607-e99b-4493-b715-3e194fcb3933" alt="Example Image" width="70%">
217
+ </p>
218
+
219
+ **3D Environment:** We collect 37 high-quality 3D environments assets from [Fab](https://www.fab.com). To minimize the domain gap between rendered data and real-world videos, we primarily select visually realistic 3D scenes, while choosing a few stylized or surreal 3D scenes as a supplement. To ensure data diversity, the selected scenes cover a variety of indoor and outdoor settings, such as city streets, shopping malls, cafes, office rooms, and the countryside.
220
+
221
+ **Character:** We collect 66 different human 3D models as characters from [Fab](https://www.fab.com) and [Mixamo](https://www.mixamo.com).
222
+
223
+ **Animation:** We collect 93 different animations from [Fab](https://www.fab.com) and [Mixamo](https://www.mixamo.com), including common actions such as waving, dancing, and cheering. We use these animations to drive the collected characters and create diverse datasets through various combinations.
224
+
225
+ **Camera:** To ensure camera movements are diverse and closely resemble real-world distributions, we create a wide range of camera trajectories and parameters to cover various situations. To achieve this by designing rules to batch-generate random camera starting positions and movement trajectories:
226
+
227
+ 1. Camera Starting Position.
228
+
229
+ We take the character's position as the center of a hemisphere with a radius of {3m, 5m, 7m, 10m} based on the size of the 3D scene and randomly sample within this range as the camera's starting point, ensuring the closest distance to the character is greater than 0.5m and the pitch angle is within 45 degrees.
230
+
231
+ 2. Camera Trajectories.
232
+
233
+ - **Pan & Tilt**:
234
+ The camera rotation angles are randomly selected within the range, with pan angles ranging from 5 to 45 degrees and tilt angles ranging from 5 to 30 degrees, with directions randomly chosen left/right or up/down.
235
+
236
+ - **Basic Translation**:
237
+ The camera translates along the positive and negative directions of the xyz axes, with movement distances randomly selected within the range of $[\frac{1}{4}, 1] \times \text{distance2character}$.
238
+
239
+ - **Basic Arc Trajectory**:
240
+ The camera moves along an arc, with rotation angles randomly selected within the range of 15 to 75 degrees.
241
+
242
+ - **Random Trajectories**:
243
+ 1-3 points are sampled in space, and the camera moves from the initial position through these points as the movement trajectory, with the total movement distance randomly selected within the range of $[\frac{1}{4}, 1] \times \text{distance2character}$. The polyline is smoothed to make the movement more natural.
244
+
245
+ - **Static Camera**:
246
+ The camera does not translate or rotate during shooting, maintaining a fixed position.
247
+
248
+ 3. Camera Movement Speed.
249
+
250
+ To further enhance the diversity of trajectories, 50% of the training data uses constant-speed camera trajectories, while the other 50% uses variable-speed trajectories generated by nonlinear functions. Consider a camera trajectory with a total of $f$ frames, starting at location $L_{start}$ and ending at position $L_{end}$. The location at the $i$-th frame is given by:
251
+ ```math
252
+ L_i = L_{start} + (L_{end} - L_{start}) \cdot \left( \frac{1 - \exp(-a \cdot i/f)}{1 - \exp(-a)} \right),
253
+ ```
254
+ where $a$ is an adjustable parameter to control the trajectory speed. When $a > 0$, the trajectory starts fast and then slows down; when $a < 0$, the trajectory starts slow and then speeds up. The larger the absolute value of $a$, the more drastic the change.
255
+
256
+ 4. Camera Parameters.
257
+
258
+ We chose four set of camera parameters: {focal=18mm, aperture=10}, {focal=24mm, aperture=5}, {focal=35mm, aperture=2.4} and {focal=50mm, aperture=2.4}.
259
+
260
+ ### 2. Statistics and Configurations
261
+
262
+ Dataset Statistics:
263
+
264
+ | Number of Dynamic Scenes | Camera per Scene | Total Videos |
265
+ |:------------------------:|:----------------:|:------------:|
266
+ | 13,600 | 10 | 136,000 |
267
+
268
+ Video Configurations:
269
+
270
+ | Resolution | Frame Number | FPS |
271
+ |:-----------:|:------------:|:------------------------:|
272
+ | 1280x1280 | 81 | 15 |
273
+
274
+ Note: You can use 'center crop' to adjust the video's aspect ratio to fit your video generation model, such as 16:9, 9:16, 4:3, or 3:4.
275
+
276
+ Camera Configurations:
277
+
278
+ | Focal Length | Aperture | Sensor Height | Sensor Width |
279
+ |:-----------------------:|:------------------:|:-------------:|:------------:|
280
+ | 18mm, 24mm, 35mm, 50mm | 10.0, 5.0, 2.4 | 23.76mm | 23.76mm |
281
+
282
+
283
+
284
+ ### 3. File Structure
285
+ ```
286
+ MultiCamVideo-Dataset
287
+ ├── train
288
+ │ ├── f18_aperture10
289
+ │ │ ├── scene1 # one dynamic scene
290
+ │ │ │ ├── videos
291
+ │ │ │ │ ├── cam01.mp4 # synchronized 81-frame videos at 1280x1280 resolution
292
+ │ │ │ │ ├── cam02.mp4
293
+ │ │ │ │ ├── ...
294
+ │ │ │ │ └── cam10.mp4
295
+ │ │ │ └── cameras
296
+ │ │ │ └── camera_extrinsics.json # 81-frame camera extrinsics of the 10 cameras
297
+ │ │ ├── ...
298
+ │ │ └── scene3400
299
+ │ ├── f24_aperture5
300
+ │ │ ├── scene1
301
+ │ │ ├── ...
302
+ │ │ └── scene3400
303
+ │ ├── f35_aperture2.4
304
+ │ │ ├── scene1
305
+ │ │ ├── ...
306
+ │ │ └── scene3400
307
+ │ └── f50_aperture2.4
308
+ │ ├── scene1
309
+ │ ├── ...
310
+ │ └── scene3400
311
+ └── val
312
+ └── 10basic_trajectories
313
+ ├── videos
314
+ │ ├── cam01.mp4 # example videos corresponding to the validation cameras
315
+ │ ├── cam02.mp4
316
+ │ ├── ...
317
+ │ └── cam10.mp4
318
+ └── cameras
319
+ └── camera_extrinsics.json # 10 different trajectories for validation
320
+ ```
321
+
322
+ ### 3. Useful scripts
323
+ - Data Extraction
324
+ ```bash
325
+ cat MultiCamVideo-Dataset.part* > MultiCamVideo-Dataset.tar.gz
326
+ tar -xzvf MultiCamVideo-Dataset.tar.gz
327
+ ```
328
+ - Camera Visualization
329
+ ```python
330
+ python vis_cam.py
331
+ ```
332
+
333
+ The visualization script is modified from [CameraCtrl](https://github.com/hehao13/CameraCtrl/blob/main/tools/visualize_trajectory.py), thanks for their inspiring work.
334
+
335
+ <p align="center">
336
+ <img src="https://github.com/user-attachments/assets/f9cf342d-2fb3-40ef-a7be-edafb5775004" alt="Example Image" width="40%">
337
+ </p> -->
338
+
339
+ ## 🤗 Awesome Related Works
340
+ Feel free to explore these outstanding related works, including but not limited to:
341
+
342
+ [ReCamMaster](https://github.com/KlingTeam/ReCamMaster): ReCamMaster re-captures in-the-wild videos with novel camera trajectories.
343
+
344
+ [GCD](https://gcd.cs.columbia.edu/): GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video.
345
+
346
+ [ReCapture](https://generative-video-camera-controls.github.io/): a method for generating new videos with novel camera trajectories from a single user-provided video.
347
+
348
+ [Trajectory Attention](https://xizaoqu.github.io/trajattn/): Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing.
349
+
350
+ [GS-DiT](https://wkbian.github.io/Projects/GS-DiT/): GS-DiT provides 4D video control for a single monocular video.
351
+
352
+ [Diffusion as Shader](https://igl-hkust.github.io/das/): a versatile video generation control model for various tasks.
353
+
354
+ [TrajectoryCrafter](https://trajectorycrafter.github.io/): TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video.
355
+
356
+ [GEN3C](https://research.nvidia.com/labs/toronto-ai/GEN3C/): a generative video model with precise Camera Control and temporal 3D Consistency.
357
+
358
+ ## 🌟 Citation
359
+
360
+ Please leave us a star 🌟 and cite our paper if you find our work helpful.
361
+ ```
362
+ @inproceedings{zhu2025astra,
363
+ title={Astra: General Interactive World Model with Autoregressive Denoising},
364
+ author={Zhu, Yixuan and Feng, Jiaqi and Zheng, Wenzhao and Gao, Yuan and Tao, Xin and Wan, Pengfei and Zhou, Jie and Lu, Jiwen},
365
+ booktitle={arxiv},
366
+ year={2025}
367
+ }
368
+ ```
Try ReCamMaster with Your Own Videos Here.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ https://docs.google.com/forms/d/e/1FAIpQLSezOzGPbm8JMXQDq6EINiDf6iXn7rV4ozj6KcbQCSAzE8Vsnw/viewform?usp=dialog
infer.sh ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ CUDA_VISIBLE_DEVICES=0 python infer_origin.py \
2
+ --condition_pth /share_zhuyixuan05/zhuyixuan05/sekai-game-walking/00100100001_0004650_0004950/encoded_video.pth \
3
+ --output_path /home/zhuyixuan05/ReCamMaster/moe/infer_results/sekai.mp4 \
4
+ --prompt "A drone flying scene in a game world" \
5
+ --modality_type sekai
6
+
7
+ CUDA_VISIBLE_DEVICES=1 python infer_moe.py \
8
+ --condition_pth /share_zhuyixuan05/zhuyixuan05/nuscenes_video_generation_dynamic/scenes/scene-0001_CAM_FRONT/encoded_video-480p.pth \
9
+ --output_path /home/zhuyixuan05/ReCamMaster/moe/infer_results/nuscenes.mp4 \
10
+ --prompt "A car is driving" \
11
+ --modality_type nuscenes
12
+
13
+ CUDA_VISIBLE_DEVICES=0 python infer_origin.py \
14
+ --condition_pth /share_zhuyixuan05/zhuyixuan05/spatialvid/a9a6d37f-0a6c-548a-a494-7d902469f3f2_0000000_0000300/encoded_video.pth \
15
+ --output_path /home/zhuyixuan05/ReCamMaster/moe/infer_results/spatialvid.mp4 \
16
+ --prompt "A man is entering the room" \
17
+ --modality_type sekai
18
+
19
+ CUDA_VISIBLE_DEVICES=1 python infer_moe.py \
20
+ --condition_pth /share_zhuyixuan05/zhuyixuan05/openx-fractal-encoded/episode_000001/encoded_video.pth \
21
+ --output_path /home/zhuyixuan05/ReCamMaster/moe/infer_results/openx.mp4 \
22
+ --prompt "A robotic arm is moving the object" \
23
+ --modality_type openx
24
+
25
+ CUDA_VISIBLE_DEVICES=1 python infer_origin.py \
26
+ --condition_pth /share_zhuyixuan05/zhuyixuan05/sekai-game-drone/00500210001_0012150_0012450/encoded_video.pth \
27
+ --output_path /home/zhuyixuan05/ReCamMaster/moe/infer_results/edit.mp4 \
28
+ --prompt "A drone flying scene in a game world, and it starts to rain" \
29
+ --modality_type sekai
30
+
31
+
32
+ CUDA_VISIBLE_DEVICES=0 python infer_origin.py \
33
+ --condition_pth /share_zhuyixuan05/zhuyixuan05/spatialvid/0268e6b0-f41e-5c2f-bf6b-936e55dc4a05_0000600_0000900/encoded_video.pth \
34
+ --output_path /home/zhuyixuan05/ReCamMaster/moe/infer_results/spatialvid.mp4 \
35
+ --prompt "walking in the city, the weather from day turns to night" \
36
+ --modality_type sekai \
37
+ --direction "right" \
38
+ --initial_condition_frames "1"
39
+
40
+ CUDA_VISIBLE_DEVICES=1 python infer_moe.py \
41
+ --condition_pth /share_zhuyixuan05/zhuyixuan05/nuscenes_video_generation_dynamic/scenes/scene-0001_CAM_FRONT/encoded_video-480p.pth \
42
+ --output_path /home/zhuyixuan05/ReCamMaster/moe/infer_results/nuscenes.mp4 \
43
+ --prompt "A car is driving" \
44
+ --modality_type nuscenes
45
+
46
+ CUDA_VISIBLE_DEVICES=1 python infer_moe.py \
47
+ --condition_pth /share_zhuyixuan05/zhuyixuan05/openx-fractal-encoded/episode_000001/encoded_video.pth \
48
+ --output_path /home/zhuyixuan05/ReCamMaster/moe/infer_results/openx.mp4 \
49
+ --prompt "A robotic arm is moving the object" \
50
+ --modality_type openx
51
+
52
+ CUDA_VISIBLE_DEVICES=1 python infer_origin.py \
53
+ --condition_pth /share_zhuyixuan05/zhuyixuan05/sekai-game-walking/00100100001_0004650_0004950/encoded_video.pth \
54
+ --output_path /home/zhuyixuan05/ReCamMaster/moe/infer_results/sekai.mp4 \
55
+ --prompt "A drone flying scene in a game world" \
56
+ --modality_type sekai
infer_demo.sh ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ CUDA_VISIBLE_DEVICES=3 python ./scripts/infer_demo.py \
2
+ --condition_pth ./examples/condition_pth/garden_1.pth \
3
+ --start_frame 0 \
4
+ --initial_condition_frames 1 \
5
+ --frames_per_generation 8 \
6
+ --total_frames_to_generate 24 \
7
+ --dit_path /path/to/your/model.pth \
8
+ --prompt "A sunlit European street lined with historic buildings and vibrant greenery creates a warm, charming, and inviting atmosphere. The scene shows a picturesque open square paved with red bricks, surrounded by classic narrow townhouses featuring tall windows, gabled roofs, and dark-painted facades. On the right side, a lush arrangement of potted plants and blooming flowers adds rich color and texture to the foreground. A vintage-style streetlamp stands prominently near the center-right, contributing to the timeless character of the street. Mature trees frame the background, their leaves glowing in the warm afternoon sunlight. Bicycles are visible along the edges of the buildings, reinforcing the urban yet leisurely feel. The sky is bright blue with scattered clouds, and soft sun flares enter the frame from the left, enhancing the scene’s inviting, peaceful mood." \
9
+ --add_icons \
10
+ --modality_type sekai \
11
+ --direction forward_left \
12
+ --output_path /mnt/data/louis_crq/astra2/astra_test/Astra/examples/output_videos/output_moe_framepack_sliding.mp4
pip-list.txt ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ (ddrm) zyx@zwl:~$ pip list
2
+ Package Version Editable project location
3
+ ------------------------ --------------- ---------------------------------------------------------
4
+ absl-py 2.0.0
5
+ accelerate 1.0.1
6
+ addict 2.4.0
7
+ aiohttp 3.8.5
8
+ aiosignal 1.3.1
9
+ albumentations 1.4.6
10
+ annotated-types 0.6.0
11
+ antlr4-python3-runtime 4.9.3
12
+ appdirs 1.4.4
13
+ asttokens 2.4.1
14
+ async-timeout 4.0.3
15
+ attrs 23.1.0
16
+ backcall 0.2.0
17
+ basicsr 1.2.0+1.4.2 /home/zyx/Retinexformer-master
18
+ beautifulsoup4 4.12.3
19
+ blessed 1.20.0
20
+ blobfile 2.0.2
21
+ cachetools 5.3.1
22
+ certifi 2023.5.7
23
+ cffi 1.15.1
24
+ charset-normalizer 3.2.0
25
+ click 8.1.7
26
+ cmake 3.26.4
27
+ contourpy 1.1.1
28
+ cycler 0.11.0
29
+ decorator 5.1.1
30
+ decord 0.6.0
31
+ diffusers 0.31.0
32
+ dlib 19.24.6
33
+ docker-pycreds 0.4.0
34
+ einops 0.6.1
35
+ executing 2.1.0
36
+ face-alignment 1.4.1
37
+ facexlib 0.3.0
38
+ filelock 3.12.2
39
+ filterpy 1.4.5
40
+ fire 0.5.0
41
+ flatbuffers 23.5.26
42
+ fonttools 4.42.1
43
+ frozenlist 1.4.0
44
+ fsspec 2023.9.1
45
+ ftfy 6.1.1
46
+ future 0.18.3
47
+ gdown 5.2.0
48
+ gfpgan 1.3.8 /home/zyx/anaconda3/envs/ddrm/lib/python3.8/site-packages
49
+ gitdb 4.0.10
50
+ GitPython 3.1.37
51
+ google-auth 2.23.0
52
+ google-auth-oauthlib 1.0.0
53
+ gpustat 1.1
54
+ grpcio 1.58.0
55
+ guided-diffusion 0.0.0 /home/zyx/GenerativeDiffusionPrior
56
+ huggingface-hub 0.30.2
57
+ idna 3.4
58
+ imageio 2.31.5
59
+ imgaug 0.4.0
60
+ importlib-metadata 6.8.0
61
+ importlib-resources 6.1.0
62
+ ip-adapter 0.1.0
63
+ ipython 8.12.3
64
+ jedi 0.19.2
65
+ Jinja2 3.1.2
66
+ joblib 1.3.2
67
+ kiwisolver 1.4.5
68
+ lazy_loader 0.3
69
+ lightning-utilities 0.9.0
70
+ lit 16.0.6
71
+ llvmlite 0.41.0
72
+ lmdb 1.4.1
73
+ loguru 0.7.2
74
+ lora-diffusion 0.1.7
75
+ loralib 0.1.2
76
+ lpips 0.1.4
77
+ lxml 4.9.3
78
+ Markdown 3.4.4
79
+ markdown-it-py 3.0.0
80
+ MarkupSafe 2.1.3
81
+ matplotlib 3.7.3
82
+ matplotlib-inline 0.1.7
83
+ mdurl 0.1.2
84
+ mediapipe 0.10.5
85
+ mmcv 1.7.0
86
+ mmengine 0.10.7
87
+ mpi4py 3.1.4
88
+ mpmath 1.3.0
89
+ multidict 6.0.4
90
+ mypy-extensions 1.0.0
91
+ natsort 8.4.0
92
+ networkx 3.1
93
+ ninja 1.11.1.1
94
+ numba 0.58.0
95
+ numpy 1.24.4
96
+ nvidia-cublas-cu11 11.10.3.66
97
+ nvidia-cuda-cupti-cu11 11.7.101
98
+ nvidia-cuda-nvrtc-cu11 11.7.99
99
+ nvidia-cuda-runtime-cu11 11.7.99
100
+ nvidia-cudnn-cu11 8.5.0.96
101
+ nvidia-cufft-cu11 10.9.0.58
102
+ nvidia-curand-cu11 10.2.10.91
103
+ nvidia-cusolver-cu11 11.4.0.1
104
+ nvidia-cusparse-cu11 11.7.4.91
105
+ nvidia-ml-py 12.535.77
106
+ nvidia-nccl-cu11 2.14.3
107
+ nvidia-nvtx-cu11 11.7.91
108
+ oauthlib 3.2.2
109
+ omegaconf 2.3.0
110
+ open-clip-torch 2.20.0
111
+ openai-clip 1.0.1
112
+ opencv-contrib-python 4.8.0.76
113
+ opencv-python 4.8.0.74
114
+ opencv-python-headless 4.9.0.80
115
+ packaging 23.1
116
+ pandas 2.0.3
117
+ parso 0.8.4
118
+ pathtools 0.1.2
119
+ peft 0.13.2
120
+ pexpect 4.9.0
121
+ pickleshare 0.7.5
122
+ Pillow 10.0.0
123
+ pip 23.1.2
124
+ platformdirs 3.11.0
125
+ prompt_toolkit 3.0.48
126
+ protobuf 3.20.3
127
+ psutil 5.9.5
128
+ ptyprocess 0.7.0
129
+ pure_eval 0.2.3
130
+ pyasn1 0.5.0
131
+ pyasn1-modules 0.3.0
132
+ pycparser 2.21
133
+ pycryptodomex 3.18.0
134
+ pydantic 2.7.1
135
+ pydantic_core 2.18.2
136
+ pyDeprecate 0.3.1
137
+ Pygments 2.18.0
138
+ pyiqa 0.1.8
139
+ pyparsing 3.1.1
140
+ pyre-extensions 0.0.23
141
+ PySocks 1.7.1
142
+ python-dateutil 2.8.2
143
+ pytorch-fid 0.3.0
144
+ pytorch-lightning 1.4.2
145
+ pytz 2023.3.post1
146
+ PyWavelets 1.4.1
147
+ PyYAML 6.0.1
148
+ realesrgan 0.3.0
149
+ regex 2023.8.8
150
+ requests 2.31.0
151
+ requests-oauthlib 1.3.1
152
+ rich 14.0.0
153
+ rsa 4.9
154
+ safetensors 0.4.5
155
+ scikit-image 0.21.0
156
+ scikit-learn 1.3.2
157
+ scipy 1.10.1
158
+ sentencepiece 0.1.99
159
+ sentry-sdk 1.31.0
160
+ setproctitle 1.3.2
161
+ setuptools 68.0.0
162
+ shapely 2.0.1
163
+ six 1.16.0
164
+ smmap 5.0.1
165
+ sounddevice 0.4.6
166
+ soupsieve 2.5
167
+ stack-data 0.6.3
168
+ sympy 1.12
169
+ tb-nightly 2.14.0a20230808
170
+ tensorboard 2.14.0
171
+ tensorboard-data-server 0.7.1
172
+ termcolor 2.3.0
173
+ threadpoolctl 3.2.0
174
+ tifffile 2023.7.10
175
+ timm 0.9.7
176
+ tokenizers 0.15.0
177
+ tomli 2.0.1
178
+ torch 1.13.1+cu116
179
+ torchmetrics 0.5.0
180
+ torchvision 0.14.1+cu116
181
+ tqdm 4.65.0
182
+ traitlets 5.14.3
183
+ transformers 4.36.2
184
+ triton 2.0.0
185
+ typing_extensions 4.11.0
186
+ typing-inspect 0.9.0
187
+ tzdata 2023.3
188
+ urllib3 1.26.16
189
+ vqfr 2.0.0 /home/zyx/VQFR
190
+ wandb 0.15.11
191
+ wcwidth 0.2.6
192
+ Werkzeug 2.3.7
193
+ wheel 0.40.0
194
+ xformers 0.0.16
195
+ yapf 0.40.2
196
+ yarl 1.9.2
197
+ zipp 3.17.0
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch>=2.0.0
2
+ torchvision
3
+ cupy-cuda12x
4
+ transformers==4.46.2
5
+ controlnet-aux==0.0.7
6
+ imageio
7
+ imageio[ffmpeg]
8
+ safetensors
9
+ einops
10
+ sentencepiece
11
+ protobuf
12
+ modelscope
13
+ ftfy