Video-to-Video
Diffusers
Safetensors
File size: 7,612 Bytes
6526a8d
 
743ed25
 
6526a8d
 
40de76f
084e05e
6526a8d
 
40de76f
097e047
6526a8d
 
40de76f
3921dda
 
ff5ca16
bad35fd
0c8be9e
 
 
40de76f
3921dda
 
6526a8d
bad35fd
6526a8d
288d99a
0ff8907
6526a8d
 
 
 
3921dda
 
40de76f
3921dda
6526a8d
 
 
3921dda
 
288d99a
3921dda
6526a8d
40de76f
 
3921dda
40de76f
 
 
3921dda
40de76f
3921dda
 
 
6526a8d
 
 
3921dda
288d99a
40de76f
3921dda
bad35fd
40de76f
3921dda
40de76f
 
3921dda
 
 
 
40de76f
3921dda
 
 
 
 
 
6526a8d
 
 
40de76f
3921dda
40de76f
6526a8d
 
0ff8907
6526a8d
 
 
3921dda
 
40de76f
3921dda
 
6526a8d
3921dda
6526a8d
3921dda
6526a8d
 
 
3921dda
 
 
 
40de76f
6526a8d
0ff8907
3921dda
 
 
40de76f
3921dda
bad35fd
3921dda
 
 
 
40de76f
 
 
 
 
 
 
 
 
3921dda
6526a8d
 
40de76f
 
3921dda
 
 
 
 
6526a8d
3921dda
 
 
6526a8d
3921dda
 
 
 
 
 
 
 
 
 
 
6526a8d
3921dda
 
6c931e0
3921dda
6526a8d
 
0ff8907
6526a8d
 
 
 
 
 
 
 
 
 
 
3921dda
40de76f
0ff8907
6526a8d
3921dda
 
 
40de76f
084e05e
40de76f
084e05e
40de76f
3921dda
40de76f
3921dda
40de76f
3921dda
40de76f
3921dda
bad35fd
 
 
 
 
 
 
 
 
 
3921dda
6ca06f3
40de76f
 
 
 
743ed25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
license: apache-2.0
pipeline_tag: video-to-video
library_name: diffusers
---

<h1 align="center">SwiftVR: Real-Time One-Step Generative Video Restoration</h1>



<p align="center"><img src="assets/teaser.avif" width="100%" alt="SwiftVR teaser"></p>



> **SwiftVR** is the first generative video restoration model to reach **real-time 1080p streaming on a consumer-grade GPU** (β‰ˆ26 FPS on a single RTX 5090), sustains **31 FPS at QHD (2560Γ—1440)** and **14 FPS at 4K (3840Γ—2160)** on a single H100, and streams at resolutions where every compared diffusion-based VR baseline runs out of memory.

<p>
  <a href="https://arxiv.org/abs/2606.09516"><img src="https://img.shields.io/badge/arXiv-2606.09516-b31b1b.svg?style=flat-square" alt="arXiv"></a>
  <a href="https://h-oliday.github.io/SwiftVR/"><img src="https://img.shields.io/badge/Project-Page-1f8acb.svg?style=flat-square" alt="Project Page"></a>
  <a href="https://github.com/H-oliday/SwiftVR">
  <img src="https://img.shields.io/badge/GitHub-Code-181717.svg?style=flat-square&logo=github" alt="GitHub">
</a>
  <a href="https://github.com/H-oliday/SwiftVR/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=flat-square" alt="License"></a>
</p>


SwiftVR is a streaming one-step generative video restoration (VR) framework presented in [SwiftVR: Real-Time One-Step Generative Video Restoration](https://arxiv.org/abs/2606.09516).

## Updates

- [2026/06] Release the inference code and pretrained weights πŸŽ‰





## ✨ Highlights

- **Mask-free shifted-window self-attention (MFSWA).** Each spatial window is **pre-gathered into a dense tensor**, so every attention call reduces to a single standard scaled-dot-product (SDPA) call β€” *no attention mask, cyclic shift, or padding ever enters the graph*. This gives a **1.62Γ— throughput gain over its full-attention teacher** at essentially identical quality, with **no dedicated sparse kernel**.
- **Restoration-aware Autoencoder (ReAE).** A lightweight encoder–decoder jointly fine-tuned with the DiT in pixel space removes the heavy-3D-VAE / tiled-decoding bottleneck.
- **Causal chunk-wise streaming.** A minimal causal protocol (no rolling KV cache, no overlapped DiT inference) bounds the temporal axis, confining the residual \(\mathcal{O}(N^2)\) cost to the spatial axes.


## πŸ“Š Results

### Efficiency at 2560Γ—1440 (single H100, causal streaming, 24 frames)

| Metric | DOVE (tile) | SeedVR2-3B (tile)| FlashVSR-Tiny | **SwiftVR (Ours)** |
|---|:---:|:---:|:---:|:---:|
| Avg. Time (s) ↓ | 27.615 | 17.320 | 2.493 | 0.766 |
| FPS ↑ | 0.85 | 1.39 | 9.61 | 31.32 |
| Peak Mem. (GB) ↓ | 59.24 | 35.35 | 34.35 | 38.01 |

> At **3840Γ—2160**, every compared diffusion-based VR baseline **OOMs** on a single H100; SwiftVR sustains **14 FPS**.

### Qualitative comparison

<img src="assets/qualitative.png" width="100%" alt="SwiftVR teaser">



## πŸ›  Installation

```bash
git clone https://github.com/H-oliday/SwiftVR.git
cd SwiftVR

conda create -n swiftvr python=3.10 -y
conda activate swiftvr

# Install PyTorch matching your CUDA toolkit first, e.g. CUDA 12.4:
pip install torch==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu124

# Install SwiftVR (editable) and its dependencies:
pip install -e .
```

<details>
<summary><b>Hardware notes</b></summary>

- **Server:** single H100-80G reproduces the QHD/4K numbers above.
- **Consumer:** single RTX 5090 reaches β‰ˆ26 FPS at 1080p with the *same checkpoint* (default PyTorch SDPA path, bfloat16, causal chunk protocol).
- No hardware-specific retraining or kernel rewrite is required on any platform.
</details>



## πŸ—‚ Model Zoo

| Model Name | Date | Backbone | Link |
|---|---|---|---|
| SwiftVR | 2026.06 | Wan2.2-TI2V-5B | [πŸ€— HuggingFace](https://huggingface.co/H-oliday/SwiftVR) |

```bash
huggingface-cli download H-oliday/SwiftVR --local-dir checkpoints/
```

Expected checkpoint layout (the directory passed to `from_pretrained`):

```
checkpoints/
β”œβ”€β”€ reae.safetensors            # Restoration-aware Autoencoder weights
β”œβ”€β”€ prompt_embedding.safetensors# precomputed empty-prompt text embedding (key: "prompt_emb")
└── transformer/                # diffusers-format DiT
    β”œβ”€β”€ config.json
    └── diffusion_pytorch_model.safetensors
```


## πŸš€ Quick Start

### Python API

```python
from swiftvr import SwiftVRPipeline

pipe = SwiftVRPipeline.from_pretrained("H-oliday/SwiftVR").to("cuda", dtype="bfloat16")

pipe.restore_video("low_quality.mp4", "restored.mp4", upscale=4)
```

`restore_video` also accepts an image folder as input and can write a PNG sequence with `png_save=True`.

Tunable knobs include:

* `clip_len`: middle chunk size, multiple of 4
* `dit_overlap`: overlap for DiT inference
* `fps`: output video frame rate
* `quality`: 0–100, mapped to x265 CRF
* `queue_size`: pipeline queue size


### Streaming (causal, chunk by chunk, no future frames)

Causal, chunk-by-chunk restoration without future frames.

```python
session = pipe.stream(clip_len=24, resolution=(1920, 1080))

for lq_chunk in read_chunks("low_quality.mp4", n=24):   # lq_chunk: [T, H, W, 3] uint8
    hq = session.step(lq_chunk)        # [1, T', 3, H', W'] in [0, 1], or None if buffered
    if hq is not None:
        write(hq)

tail = session.flush()                 # flush the final buffered frames
```

### Command line

```bash
python scripts/inference.py \
  --input low_quality.mp4 \
  --output restored.mp4 \
  --checkpoint checkpoints/ \
  --upscale 4 \
  --clip-len 24 \
  --dtype bfloat16 \
```

Use `--png` to write a PNG sequence.


## 🎬 More Visual Results

> Full-length restored clips (low-quality input β†’ SwiftVR, played back to back).


<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_1.mp4" controls width="100%"></video>

<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_2.mp4" controls width="100%"></video>

<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_3.mp4" controls width="100%"></video>




## πŸ™ Acknowledgements

SwiftVR builds on [Wan2.2-TI2V-5B](https://github.com/Wan-Video), the lightweight autoencoder [TAEHV](https://github.com/madebyollin/taehv), and the [RealBasicVSR](https://github.com/ckkelvinchan/RealBasicVSR) degradation pipeline. We thank the authors of [DOVE](https://github.com/zhengchen1999/DOVE), [SeedVR2](https://github.com/ByteDance-Seed/SeedVR), and [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) for releasing strong baselines, and the [UltraVideo](https://github.com/Tele-AI/UltraVideo) team for the training corpus.



## πŸ“œ License

SwiftVR is released under the **Apache License 2.0**.

Copyright 2026 SwiftVR Authors.

Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at:

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, this project is distributed on an **"AS IS" BASIS**, without warranties or conditions of any kind, either express or implied. See the [LICENSE](./LICENSE) file for the full license text.

## πŸ“œ Citation

```bibtex
@article{yan2026swiftvr,
  title={SwiftVR: Real-Time One-Step Generative Video Restoration},
  author={Yan, Jiaqi and Chen, Xiangyu and Zhong, Xinlin and Huang, Haibin and Zhang, Chi and Liu, Jie and Zhou, Jiantao and Li, Xuelong},
  journal={arXiv preprint arXiv:2606.09516},
  year={2026}
}
```


## Contact

If you have any questions, feel free to reach out:

* Email: [kakibluee@gmail.com](mailto:kakibluee@gmail.com)