File size: 5,790 Bytes
a3895c3 7abef12 a3895c3 7bddbbf a3895c3 7bddbbf a3895c3 7bddbbf b0ad88b 7bddbbf 2e4af12 7bddbbf b89b8c1 7bddbbf 7509b87 7bddbbf 2a3131a 7bddbbf 56128f9 7bddbbf 56128f9 f589e22 7bddbbf a3895c3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
---
license: mit
base_model:
- Wan-AI/Wan2.1-T2V-1.3B
pipeline_tag: video-to-video
---
# StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
<!-- <div align="center" style="margin-top: 0px; margin-bottom: 0px;">
<img src=asset/StereoPilot_logo.png width="30%"/>
</div> -->
<div align="center">
_**[Guibao Shen](https://a-bigbao.github.io)<sup>1,3*โ </sup>, [Yihua Du](https://hit-perfect.github.io)<sup>1*</sup>, [Wenhang Ge](https://g3956.github.io/wenhangge.github.io/)<sup>1,3*โ </sup>, [Jing He](https://jingheya.github.io)<sup>1</sup>, [Chirui Chang](https://hit-perfect.github.io/StereoPilot/)<sup>3</sup>, [Donghao Zhou](https://correr-zhou.github.io/)<sup>4</sup>, [Zhen Yang](https://zhenyangcs.github.io/)<sup>1</sup>, [Luozhou Wang](https://wileewang.github.io)<sup>1</sup>, [Xin Tao](https://www.xtao.website)<sup>3</sup>, [Ying-Cong Chen](https://www.yingcong.me)<sup>1,2โก</sup>**_
<sup>1</sup>HKUST(GZ), <sup>2</sup>HKUST, <sup>3</sup>Kling Team, Kuaishou Technology, <sup>4</sup>CUHK
(*Equal contribution, โ This work was conducted during the author's internship at Kling, โกCorresponding author)
</div>
## ๐ Introduction
**TL;DR:** We propose **StereoPilot**, an efficient feed-forward architecture that leverages pretrained video diffusion transformers to directly synthesize novel views, overcoming the limitations of *Depth-Warp-Inpaint* methods without iterative denoising. With a domain switcher and cycle consistency loss, it enables robust multi-format stereo conversion. We also introduce **UniStereo**, the first large-scale unified dataset featuring both parallel and converged stereo formats.
<div align="center">
[](https://www.youtube.com/watch?v=P14q02ajKT0)
**๐ฌ Click the image to view our showcase video**
</div>
## ๐ฅ Updates
- __[2025.12.16]__: Release inference code and [Project Page](https://hit-perfect.github.io/StereoPilot/).
## โ๏ธ Requirements
Our inference environment:
- Python 3.12
- CUDA 12.1
- PyTorch 2.4.1
- GPU: NVIDIA A800 (only ~23GB VRAM required)
## ๐ ๏ธ Installation
**Step 1:** Clone the repository
```bash
git clone https://github.com/KlingTeam/StereoPilot.git
cd StereoPilot
```
**Step 2:** Create conda environment
```bash
conda create -n StereoPilot python=3.12
conda activate StereoPilot
```
**Step 3:** Install dependencies
```bash
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation
```
**Step 4:** Download model checkpoints
Place the following files in the `ckpt/` directory:
| File | Description |
|------|-------------|
| [`StereoPilot.safetensors`](https://huggingface.co/KlingTeam/StereoPilot) | StereoPilot model weights |
| [`Wan2.1-T2V-1.3B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) | Base Wan2.1 model directory |
Download StereoPilot.safetensor & Wan2.1-1.3B base model:
```bash
pip install "huggingface_hub[cli]"
huggingface-cli download KlingTeam/StereoPilot StereoPilot.safetensors --local-dir ./ckpt
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./ckpt/Wan2.1-T2V-1.3B
```
## ๐ Inference
### Input Requirements
For each input video, you need:
1. **Video file** (`.mp4`): Monocular video, 81 frames, 832ร480 resolution, 16fps
2. **Prompt file** (`.txt`): Text description of the video content (same name as video)
Example (you can try the cases in the `sample/` folder):
```
sample/
โโโ my_video.mp4
โโโ my_video.txt
```
### Running Inference
**Basic usage:**
```bash
# Edit toml/infer.toml to customize model paths. If you followed the above steps, there is no need to change
python sample.py \
--config toml/infer.toml \
--input /path/to/input_video.mp4 \
--output_folder /path/to/output \
--device cuda:0
```
**Using the example script:**
```bash
bash sample.sh
```
### Generate Stereo Visualization
After inference, you can generate Side-by-Side (SBS) and Red-Cyan anaglyph stereo videos for visualization:
```bash
python utils/stereo_video.py \
--left /path/to/left_eye.mp4 \
--right /path/to/right_eye.mp4 \
```
**Output files:**
| Output | Description | Viewing Device |
|--------|-------------|----------------|
| `{name}_sbs.mp4` | Side-by-Side stereo video | VR Headset <img src="asset/VR_Glass.png" width="24" height="24"> |
| `{name}_anaglyph.mp4` | Red-Cyan anaglyph stereo video | 3D Glasses <img src="asset/Red_Blue_Glass.png" width="24" height="24"> |
## ๐ Dataset
We introduce **UniStereo**, the first large-scale unified stereo video dataset featuring both parallel and converged stereo formats.
<div align="center">
<img src="asset/parallel_vs_converged.png" width="80%">
</div>
UniStereo consists of two parts:
- **3DMovie** - Converged stereo format from 3D movies
- **Stereo4D** - Parallel stereo format *(coming soon)*
For detailed data processing instructions, please refer to [StereoPilot_Dataprocess](./StereoPilot_Dataprocess/).
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Acknowledgments
- [Wan2.1](https://github.com/Wan-Video/Wan2.1) - Base video generation model
- [Diffusion Pipe](https://github.com/tdrussell/diffusion-pipe) - Training code base
## ๐ Citation
If you find our work helpful, please consider citing:
```bibtex
@misc{shen2025stereopilot,
title={StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors},
author={Shen, Guibao and Du, Yihua and Ge, Wenhang and He, Jing and Chang, Chirui and Zhou, Donghao and Yang, Zhen and Wang, Luozhou and Tao, Xin and Chen, Ying-Cong},
year={2025},
eprint={2512.16915},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.16915},
}
``` |