File size: 5,790 Bytes
a3895c3
 
7abef12
 
 
a3895c3
7bddbbf
 
a3895c3
7bddbbf
a3895c3
7bddbbf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0ad88b
7bddbbf
 
 
 
 
 
 
2e4af12
7bddbbf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b89b8c1
7bddbbf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7509b87
7bddbbf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a3131a
7bddbbf
 
 
 
 
 
56128f9
7bddbbf
 
56128f9
 
 
 
f589e22
7bddbbf
a3895c3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
license: mit
base_model:
- Wan-AI/Wan2.1-T2V-1.3B
pipeline_tag: video-to-video
---
# StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

<!-- <div align="center" style="margin-top: 0px; margin-bottom: 0px;">
<img src=asset/StereoPilot_logo.png width="30%"/>
</div> -->

<div align="center">

_**[Guibao Shen](https://a-bigbao.github.io)<sup>1,3*โ€ </sup>, [Yihua Du](https://hit-perfect.github.io)<sup>1*</sup>, [Wenhang Ge](https://g3956.github.io/wenhangge.github.io/)<sup>1,3*โ€ </sup>, [Jing He](https://jingheya.github.io)<sup>1</sup>, [Chirui Chang](https://hit-perfect.github.io/StereoPilot/)<sup>3</sup>, [Donghao Zhou](https://correr-zhou.github.io/)<sup>4</sup>, [Zhen Yang](https://zhenyangcs.github.io/)<sup>1</sup>, [Luozhou Wang](https://wileewang.github.io)<sup>1</sup>, [Xin Tao](https://www.xtao.website)<sup>3</sup>, [Ying-Cong Chen](https://www.yingcong.me)<sup>1,2โ€ก</sup>**_

<sup>1</sup>HKUST(GZ), <sup>2</sup>HKUST, <sup>3</sup>Kling Team, Kuaishou Technology, <sup>4</sup>CUHK

(*Equal contribution, โ€ This work was conducted during the author's internship at Kling, โ€กCorresponding author)

</div>

## ๐Ÿ“– Introduction

**TL;DR:** We propose **StereoPilot**, an efficient feed-forward architecture that leverages pretrained video diffusion transformers to directly synthesize novel views, overcoming the limitations of *Depth-Warp-Inpaint* methods without iterative denoising. With a domain switcher and cycle consistency loss, it enables robust multi-format stereo conversion. We also introduce **UniStereo**, the first large-scale unified dataset featuring both parallel and converged stereo formats.

<div align="center">

[![Watch the video](./asset/showcase_preview.png)](https://www.youtube.com/watch?v=P14q02ajKT0)

**๐ŸŽฌ Click the image to view our showcase video**

</div>

## ๐Ÿ”ฅ Updates

- __[2025.12.16]__: Release inference code and [Project Page](https://hit-perfect.github.io/StereoPilot/).


## โš™๏ธ Requirements

Our inference environment:
- Python 3.12
- CUDA 12.1
- PyTorch 2.4.1
- GPU: NVIDIA A800 (only ~23GB VRAM required)

## ๐Ÿ› ๏ธ Installation

**Step 1:** Clone the repository

```bash
git clone https://github.com/KlingTeam/StereoPilot.git

cd StereoPilot
```

**Step 2:** Create conda environment

```bash
conda create -n StereoPilot python=3.12

conda activate StereoPilot
```

**Step 3:** Install dependencies

```bash
pip install -r requirements.txt

pip install flash-attn==2.7.4.post1 --no-build-isolation
```

**Step 4:** Download model checkpoints

Place the following files in the `ckpt/` directory:

| File | Description |
|------|-------------|
| [`StereoPilot.safetensors`](https://huggingface.co/KlingTeam/StereoPilot) | StereoPilot model weights |
| [`Wan2.1-T2V-1.3B`](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) | Base Wan2.1 model directory |

Download StereoPilot.safetensor & Wan2.1-1.3B base model:

```bash
pip install "huggingface_hub[cli]"

huggingface-cli download KlingTeam/StereoPilot StereoPilot.safetensors --local-dir ./ckpt

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./ckpt/Wan2.1-T2V-1.3B
```

## ๐Ÿš€ Inference 

### Input Requirements

For each input video, you need:
1. **Video file** (`.mp4`): Monocular video, 81 frames, 832ร—480 resolution, 16fps
2. **Prompt file** (`.txt`): Text description of the video content (same name as video)

Example (you can try the cases in the `sample/` folder):
```
sample/
โ”œโ”€โ”€ my_video.mp4
โ””โ”€โ”€ my_video.txt   
```

### Running Inference

**Basic usage:**

```bash
# Edit toml/infer.toml to customize model paths. If you followed the above steps, there is no need to change
python sample.py \
  --config toml/infer.toml \
  --input /path/to/input_video.mp4 \
  --output_folder /path/to/output \
  --device cuda:0
```

**Using the example script:**

```bash
bash sample.sh
```

### Generate Stereo Visualization

After inference, you can generate Side-by-Side (SBS) and Red-Cyan anaglyph stereo videos for visualization:

```bash
python utils/stereo_video.py \
  --left /path/to/left_eye.mp4 \
  --right /path/to/right_eye.mp4 \
```

**Output files:**
| Output | Description | Viewing Device |
|--------|-------------|----------------|
| `{name}_sbs.mp4` | Side-by-Side stereo video | VR Headset <img src="asset/VR_Glass.png" width="24" height="24"> |
| `{name}_anaglyph.mp4` | Red-Cyan anaglyph stereo video | 3D Glasses <img src="asset/Red_Blue_Glass.png" width="24" height="24"> |

## ๐Ÿ“Š Dataset

We introduce **UniStereo**, the first large-scale unified stereo video dataset featuring both parallel and converged stereo formats.

<div align="center">
<img src="asset/parallel_vs_converged.png" width="80%">
</div>

UniStereo consists of two parts:
- **3DMovie** - Converged stereo format from 3D movies
- **Stereo4D** - Parallel stereo format *(coming soon)*

For detailed data processing instructions, please refer to [StereoPilot_Dataprocess](./StereoPilot_Dataprocess/).

## ๐Ÿ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## ๐Ÿ™ Acknowledgments

- [Wan2.1](https://github.com/Wan-Video/Wan2.1) - Base video generation model
- [Diffusion Pipe](https://github.com/tdrussell/diffusion-pipe) - Training code base

## ๐ŸŒŸ Citation

If you find our work helpful, please consider citing:

```bibtex
@misc{shen2025stereopilot,
  title={StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors},
  author={Shen, Guibao and Du, Yihua and Ge, Wenhang and He, Jing and Chang, Chirui and Zhou, Donghao and Yang, Zhen and Wang, Luozhou and Tao, Xin and Chen, Ying-Cong},
  year={2025},
  eprint={2512.16915},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.16915}, 
}
```