PaulSHEN1 commited on
Commit
7bddbbf
·
verified ·
1 Parent(s): d74853b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -3
README.md CHANGED
@@ -1,3 +1,173 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
2
+
3
+ <div align="center" style="margin-top: 0px; margin-bottom: 0px;">
4
+ <img src=asset/StereoPilot_logo.png width="30%"/>
5
+ </div>
6
+
7
+ <div align="center">
8
+
9
+ ### [[Project Page]](https://hit-perfect.github.io/StereoPilot/) [arXiv] [Dataset]
10
+
11
+ _**[Guibao Shen](https://a-bigbao.github.io)<sup>1,3*†</sup>, [Yihua Du](https://hit-perfect.github.io)<sup>1*</sup>, [Wenhang Ge](https://g3956.github.io/wenhangge.github.io/)<sup>1,3*†</sup>, [Jing He](https://jingheya.github.io)<sup>1</sup>, [Chirui Chang](https://hit-perfect.github.io/StereoPilot/)<sup>3</sup>, [Donghao Zhou](https://correr-zhou.github.io/)<sup>4</sup>, [Zhen Yang](https://zhenyangcs.github.io/)<sup>1</sup>, [Luozhou Wang](https://wileewang.github.io)<sup>1</sup>, [Xin Tao](https://www.xtao.website)<sup>3</sup>, [Ying-Cong Chen](https://www.yingcong.me)<sup>1,2‡</sup>**_
12
+
13
+ <sup>1</sup>HKUST(GZ), <sup>2</sup>HKUST, <sup>3</sup>Kling Team, Kuaishou Technology, <sup>4</sup>CUHK
14
+
15
+ (*Equal contribution, †This work was conducted during the author's internship at Kling, ‡Corresponding author)
16
+
17
+ </div>
18
+
19
+ ## 📖 Introduction
20
+
21
+ **TL;DR:** We propose **StereoPilot**, an efficient feed-forward architecture that leverages pretrained video diffusion transformers to directly synthesize novel views, overcoming the limitations of *Depth-Warp-Inpaint* methods without iterative denoising. With a domain switcher and cycle consistency loss, it enables robust multi-format stereo conversion. We also introduce **UniStereo**, the first large-scale unified dataset featuring both parallel and converged stereo formats.
22
+
23
+ <div align="center">
24
+
25
+ [![Watch the video](./asset/showcase_preview.jpg)](https://www.youtube.com/watch?v=P14q02ajKT0)
26
+
27
+ **🎬 Click the image to view our showcase video**
28
+
29
+ </div>
30
+
31
+ ## 🔥 Updates
32
+
33
+ - __[2025.12.16]__: Release inference code and [Project Page](https://hit-perfect.github.io/StereoPilot/) (Hope you like it).
34
+
35
+
36
+ ## ⚙️ Requirements
37
+
38
+ Our inference environment:
39
+ - Python 3.12
40
+ - CUDA 12.1
41
+ - PyTorch 2.4.1
42
+ - GPU: NVIDIA A800 (only ~23GB VRAM required)
43
+
44
+ ## 🛠️ Installation
45
+
46
+ **Step 1:** Clone the repository
47
+
48
+ ```bash
49
+ git clone <repository-url>
50
+
51
+ cd StereoPilot
52
+ ```
53
+
54
+ **Step 2:** Create conda environment
55
+
56
+ ```bash
57
+ conda create -n StereoPilot python=3.12
58
+
59
+ conda activate StereoPilot
60
+ ```
61
+
62
+ **Step 3:** Install dependencies
63
+
64
+ ```bash
65
+ pip install -r requirements.txt
66
+
67
+ pip install flash-attn==2.7.4.post1 --no-build-isolation
68
+ ```
69
+
70
+ **Step 4:** Download model checkpoints
71
+
72
+ Place the following files in the `ckpt/` directory:
73
+
74
+ | File | Description |
75
+ |------|-------------|
76
+ | [`StereoPilot.safetensors`](https://huggingface.co/KlingTeam/StereoPilot) | StereoPilot model weights |
77
+ | [`Wan2.1-T2V-1.3B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) | Base Wan2.1 model directory |
78
+
79
+ Download StereoPilot.safetensor & Wan2.1-1.3B base model:
80
+
81
+ ```bash
82
+ pip install "huggingface_hub[cli]"
83
+
84
+ huggingface-cli download KlingTeam/StereoPilot --local-dir ./ckpt/StereoPilot.safetensors
85
+
86
+ huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./ckpt/Wan2.1-T2V-1.3B
87
+ ```
88
+
89
+ ## 🚀 Inference
90
+
91
+ ### Input Requirements
92
+
93
+ For each input video, you need:
94
+ 1. **Video file** (`.mp4`): Monocular video, 81 frames, 832×480 resolution, 16fps
95
+ 2. **Prompt file** (`.txt`): Text description of the video content (same name as video)
96
+
97
+ Example (you can try the cases in the `sample/` folder):
98
+ ```
99
+ sample/
100
+ ├── my_video.mp4
101
+ └── my_video.txt
102
+ ```
103
+
104
+ ### Running Inference
105
+
106
+ **Basic usage:**
107
+
108
+ ```bash
109
+ # Edit toml/infer.toml to customize model paths. If you followed the above steps, there is no need to change
110
+ python sample.py \
111
+ --config toml/infer.toml \
112
+ --input /path/to/input_video.mp4 \
113
+ --output_folder /path/to/output \
114
+ --device cuda:0
115
+ ```
116
+
117
+ **Using the example script:**
118
+
119
+ ```bash
120
+ bash sample.sh
121
+ ```
122
+
123
+ ### Generate Stereo Visualization
124
+
125
+ After inference, you can generate Side-by-Side (SBS) and Red-Cyan anaglyph stereo videos for visualization:
126
+
127
+ ```bash
128
+ python utils/stereo_video.py \
129
+ --left /path/to/left_eye.mp4 \
130
+ --right /path/to/right_eye.mp4 \
131
+ ```
132
+
133
+ **Output files:**
134
+ | Output | Description | Viewing Device |
135
+ |--------|-------------|----------------|
136
+ | `{name}_sbs.mp4` | Side-by-Side stereo video | VR Headset <img src="asset/VR_Glass.png" width="24" height="24"> |
137
+ | `{name}_anaglyph.mp4` | Red-Cyan anaglyph stereo video | 3D Glasses <img src="asset/Red_Blue_Glass.png" width="24" height="24"> |
138
+
139
+ ## 📊 Dataset
140
+
141
+ We introduce **UniStereo**, the first large-scale unified stereo video dataset featuring both parallel and converged stereo formats.
142
+
143
+ <div align="center">
144
+ <img src="asset/parallel_vs_converged.png" width="80%">
145
+ </div>
146
+
147
+ UniStereo consists of two parts:
148
+ - **3DMovie** - Converged stereo format from 3D movies
149
+ - **Stereo4D** - Parallel stereo format *(coming soon)*
150
+
151
+ For detailed data processing instructions, please refer to [StereoPilot_Dataprocess](./StereoPilot_Dataprocess/).
152
+
153
+ ## 📄 License
154
+
155
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
156
+
157
+ ## 🙏 Acknowledgments
158
+
159
+ - [Wan2.1](https://github.com/Wan-Video/Wan2.1) - Base video generation model
160
+
161
+ ## 🌟 Citation
162
+
163
+ If you find our work helpful, please consider citing:
164
+
165
+ ```bibtex
166
+ @article{shen2025stereopilot,
167
+ title={StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors},
168
+ author={Shen, Guibao and Du, Yihua and Ge, Wenhang and He, Jing and Chang, Chirui and Zhou, Donghao and Yang, Zhen and Wang, Luozhou and Tao, Xin and Chen, Ying-Cong},
169
+ journal={arXiv preprint},
170
+ year={2025}
171
+ }
172
+ ```
173
+