Instructions to use H-oliday/SwiftVR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use H-oliday/SwiftVR with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("H-oliday/SwiftVR", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
File size: 7,612 Bytes
6526a8d 743ed25 6526a8d 40de76f 084e05e 6526a8d 40de76f 097e047 6526a8d 40de76f 3921dda ff5ca16 bad35fd 0c8be9e 40de76f 3921dda 6526a8d bad35fd 6526a8d 288d99a 0ff8907 6526a8d 3921dda 40de76f 3921dda 6526a8d 3921dda 288d99a 3921dda 6526a8d 40de76f 3921dda 40de76f 3921dda 40de76f 3921dda 6526a8d 3921dda 288d99a 40de76f 3921dda bad35fd 40de76f 3921dda 40de76f 3921dda 40de76f 3921dda 6526a8d 40de76f 3921dda 40de76f 6526a8d 0ff8907 6526a8d 3921dda 40de76f 3921dda 6526a8d 3921dda 6526a8d 3921dda 6526a8d 3921dda 40de76f 6526a8d 0ff8907 3921dda 40de76f 3921dda bad35fd 3921dda 40de76f 3921dda 6526a8d 40de76f 3921dda 6526a8d 3921dda 6526a8d 3921dda 6526a8d 3921dda 6c931e0 3921dda 6526a8d 0ff8907 6526a8d 3921dda 40de76f 0ff8907 6526a8d 3921dda 40de76f 084e05e 40de76f 084e05e 40de76f 3921dda 40de76f 3921dda 40de76f 3921dda 40de76f 3921dda bad35fd 3921dda 6ca06f3 40de76f 743ed25 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | ---
license: apache-2.0
pipeline_tag: video-to-video
library_name: diffusers
---
<h1 align="center">SwiftVR: Real-Time One-Step Generative Video Restoration</h1>
<p align="center"><img src="assets/teaser.avif" width="100%" alt="SwiftVR teaser"></p>
> **SwiftVR** is the first generative video restoration model to reach **real-time 1080p streaming on a consumer-grade GPU** (β26 FPS on a single RTX 5090), sustains **31 FPS at QHD (2560Γ1440)** and **14 FPS at 4K (3840Γ2160)** on a single H100, and streams at resolutions where every compared diffusion-based VR baseline runs out of memory.
<p>
<a href="https://arxiv.org/abs/2606.09516"><img src="https://img.shields.io/badge/arXiv-2606.09516-b31b1b.svg?style=flat-square" alt="arXiv"></a>
<a href="https://h-oliday.github.io/SwiftVR/"><img src="https://img.shields.io/badge/Project-Page-1f8acb.svg?style=flat-square" alt="Project Page"></a>
<a href="https://github.com/H-oliday/SwiftVR">
<img src="https://img.shields.io/badge/GitHub-Code-181717.svg?style=flat-square&logo=github" alt="GitHub">
</a>
<a href="https://github.com/H-oliday/SwiftVR/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=flat-square" alt="License"></a>
</p>
SwiftVR is a streaming one-step generative video restoration (VR) framework presented in [SwiftVR: Real-Time One-Step Generative Video Restoration](https://arxiv.org/abs/2606.09516).
## Updates
- [2026/06] Release the inference code and pretrained weights π
## β¨ Highlights
- **Mask-free shifted-window self-attention (MFSWA).** Each spatial window is **pre-gathered into a dense tensor**, so every attention call reduces to a single standard scaled-dot-product (SDPA) call β *no attention mask, cyclic shift, or padding ever enters the graph*. This gives a **1.62Γ throughput gain over its full-attention teacher** at essentially identical quality, with **no dedicated sparse kernel**.
- **Restoration-aware Autoencoder (ReAE).** A lightweight encoderβdecoder jointly fine-tuned with the DiT in pixel space removes the heavy-3D-VAE / tiled-decoding bottleneck.
- **Causal chunk-wise streaming.** A minimal causal protocol (no rolling KV cache, no overlapped DiT inference) bounds the temporal axis, confining the residual \(\mathcal{O}(N^2)\) cost to the spatial axes.
## π Results
### Efficiency at 2560Γ1440 (single H100, causal streaming, 24 frames)
| Metric | DOVE (tile) | SeedVR2-3B (tile)| FlashVSR-Tiny | **SwiftVR (Ours)** |
|---|:---:|:---:|:---:|:---:|
| Avg. Time (s) β | 27.615 | 17.320 | 2.493 | 0.766 |
| FPS β | 0.85 | 1.39 | 9.61 | 31.32 |
| Peak Mem. (GB) β | 59.24 | 35.35 | 34.35 | 38.01 |
> At **3840Γ2160**, every compared diffusion-based VR baseline **OOMs** on a single H100; SwiftVR sustains **14 FPS**.
### Qualitative comparison
<img src="assets/qualitative.png" width="100%" alt="SwiftVR teaser">
## π Installation
```bash
git clone https://github.com/H-oliday/SwiftVR.git
cd SwiftVR
conda create -n swiftvr python=3.10 -y
conda activate swiftvr
# Install PyTorch matching your CUDA toolkit first, e.g. CUDA 12.4:
pip install torch==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu124
# Install SwiftVR (editable) and its dependencies:
pip install -e .
```
<details>
<summary><b>Hardware notes</b></summary>
- **Server:** single H100-80G reproduces the QHD/4K numbers above.
- **Consumer:** single RTX 5090 reaches β26 FPS at 1080p with the *same checkpoint* (default PyTorch SDPA path, bfloat16, causal chunk protocol).
- No hardware-specific retraining or kernel rewrite is required on any platform.
</details>
## π Model Zoo
| Model Name | Date | Backbone | Link |
|---|---|---|---|
| SwiftVR | 2026.06 | Wan2.2-TI2V-5B | [π€ HuggingFace](https://huggingface.co/H-oliday/SwiftVR) |
```bash
huggingface-cli download H-oliday/SwiftVR --local-dir checkpoints/
```
Expected checkpoint layout (the directory passed to `from_pretrained`):
```
checkpoints/
βββ reae.safetensors # Restoration-aware Autoencoder weights
βββ prompt_embedding.safetensors# precomputed empty-prompt text embedding (key: "prompt_emb")
βββ transformer/ # diffusers-format DiT
βββ config.json
βββ diffusion_pytorch_model.safetensors
```
## π Quick Start
### Python API
```python
from swiftvr import SwiftVRPipeline
pipe = SwiftVRPipeline.from_pretrained("H-oliday/SwiftVR").to("cuda", dtype="bfloat16")
pipe.restore_video("low_quality.mp4", "restored.mp4", upscale=4)
```
`restore_video` also accepts an image folder as input and can write a PNG sequence with `png_save=True`.
Tunable knobs include:
* `clip_len`: middle chunk size, multiple of 4
* `dit_overlap`: overlap for DiT inference
* `fps`: output video frame rate
* `quality`: 0β100, mapped to x265 CRF
* `queue_size`: pipeline queue size
### Streaming (causal, chunk by chunk, no future frames)
Causal, chunk-by-chunk restoration without future frames.
```python
session = pipe.stream(clip_len=24, resolution=(1920, 1080))
for lq_chunk in read_chunks("low_quality.mp4", n=24): # lq_chunk: [T, H, W, 3] uint8
hq = session.step(lq_chunk) # [1, T', 3, H', W'] in [0, 1], or None if buffered
if hq is not None:
write(hq)
tail = session.flush() # flush the final buffered frames
```
### Command line
```bash
python scripts/inference.py \
--input low_quality.mp4 \
--output restored.mp4 \
--checkpoint checkpoints/ \
--upscale 4 \
--clip-len 24 \
--dtype bfloat16 \
```
Use `--png` to write a PNG sequence.
## π¬ More Visual Results
> Full-length restored clips (low-quality input β SwiftVR, played back to back).
<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_1.mp4" controls width="100%"></video>
<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_2.mp4" controls width="100%"></video>
<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_3.mp4" controls width="100%"></video>
## π Acknowledgements
SwiftVR builds on [Wan2.2-TI2V-5B](https://github.com/Wan-Video), the lightweight autoencoder [TAEHV](https://github.com/madebyollin/taehv), and the [RealBasicVSR](https://github.com/ckkelvinchan/RealBasicVSR) degradation pipeline. We thank the authors of [DOVE](https://github.com/zhengchen1999/DOVE), [SeedVR2](https://github.com/ByteDance-Seed/SeedVR), and [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) for releasing strong baselines, and the [UltraVideo](https://github.com/Tele-AI/UltraVideo) team for the training corpus.
## π License
SwiftVR is released under the **Apache License 2.0**.
Copyright 2026 SwiftVR Authors.
Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at:
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, this project is distributed on an **"AS IS" BASIS**, without warranties or conditions of any kind, either express or implied. See the [LICENSE](./LICENSE) file for the full license text.
## π Citation
```bibtex
@article{yan2026swiftvr,
title={SwiftVR: Real-Time One-Step Generative Video Restoration},
author={Yan, Jiaqi and Chen, Xiangyu and Zhong, Xinlin and Huang, Haibin and Zhang, Chi and Liu, Jie and Zhou, Jiantao and Li, Xuelong},
journal={arXiv preprint arXiv:2606.09516},
year={2026}
}
```
## Contact
If you have any questions, feel free to reach out:
* Email: [kakibluee@gmail.com](mailto:kakibluee@gmail.com) |