Video-to-Video
Diffusers
Safetensors

Add pipeline tag, library name and metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +21 -105
README.md CHANGED
@@ -1,44 +1,36 @@
1
  ---
2
  license: apache-2.0
3
-
 
4
  ---
5
 
6
  <h1 align="center">SwiftVR: Real-Time One-Step Generative Video Restoration</h1>
7
 
8
-
9
-
10
- <p align="center"><img src="assets/teaser.avif" width="100%" alt="SwiftVR teaser"></p>
11
-
12
-
13
 
14
  > **SwiftVR** is the first generative video restoration model to reach **real-time 1080p streaming on a consumer-grade GPU** (β‰ˆ26 FPS on a single RTX 5090), sustains **31 FPS at QHD (2560Γ—1440)** and **14 FPS at 4K (3840Γ—2160)** on a single H100, and streams at resolutions where every compared diffusion-based VR baseline runs out of memory.
15
 
16
  <p>
17
  <a href="https://arxiv.org/abs/2606.09516"><img src="https://img.shields.io/badge/arXiv-2606.09516-b31b1b.svg?style=flat-square" alt="arXiv"></a>
18
- <a href="https://github.com/H-oliday/SwiftVR/"><img src="https://img.shields.io/badge/Project-Page-1f8acb.svg?style=flat-square" alt="Project Page"></a>
19
  <a href="https://github.com/H-oliday/SwiftVR">
20
  <img src="https://img.shields.io/badge/GitHub-Code-181717.svg?style=flat-square&logo=github" alt="GitHub">
21
  </a>
22
  <a href="https://github.com/H-oliday/SwiftVR/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=flat-square" alt="License"></a>
23
  </p>
24
 
25
-
26
 
27
  ## Updates
28
 
29
  - [2026/06] Release the inference code and pretrained weights πŸŽ‰
30
 
31
-
32
-
33
-
34
-
35
  ## ✨ Highlights
36
 
37
  - **Mask-free shifted-window self-attention (MFSWA).** Each spatial window is **pre-gathered into a dense tensor**, so every attention call reduces to a single standard scaled-dot-product (SDPA) call β€” *no attention mask, cyclic shift, or padding ever enters the graph*. This gives a **1.62Γ— throughput gain over its full-attention teacher** at essentially identical quality, with **no dedicated sparse kernel**.
38
  - **Restoration-aware Autoencoder (ReAE).** A lightweight encoder–decoder jointly fine-tuned with the DiT in pixel space removes the heavy-3D-VAE / tiled-decoding bottleneck.
39
  - **Causal chunk-wise streaming.** A minimal causal protocol (no rolling KV cache, no overlapped DiT inference) bounds the temporal axis, confining the residual \(\mathcal{O}(N^2)\) cost to the spatial axes.
40
 
41
-
42
  ## πŸ“Š Results
43
 
44
  ### Efficiency at 2560Γ—1440 (single H100, causal streaming, 24 frames)
@@ -51,16 +43,10 @@ license: apache-2.0
51
 
52
  > At **3840Γ—2160**, every compared diffusion-based VR baseline **OOMs** on a single H100; SwiftVR sustains **14 FPS**.
53
 
54
- ### Qualitative comparison
55
-
56
- <img src="assets/qualitative.png" width="100%" alt="SwiftVR teaser">
57
-
58
-
59
-
60
  ## πŸ›  Installation
61
 
62
  ```bash
63
- git clone https://github.com/Holiday/SwiftVR.git
64
  cd SwiftVR
65
 
66
  conda create -n swiftvr python=3.10 -y
@@ -73,38 +59,6 @@ pip install torch==2.10.0 torchvision==0.25.0 --index-url https://download.pytor
73
  pip install -e .
74
  ```
75
 
76
- <details>
77
- <summary><b>Hardware notes</b></summary>
78
-
79
- - **Server:** single H100-80G reproduces the QHD/4K numbers above.
80
- - **Consumer:** single RTX 5090 reaches β‰ˆ26 FPS at 1080p with the *same checkpoint* (default PyTorch SDPA path, bfloat16, causal chunk protocol).
81
- - No hardware-specific retraining or kernel rewrite is required on any platform.
82
- </details>
83
-
84
-
85
-
86
- ## πŸ—‚ Model Zoo
87
-
88
- | Model Name | Date | Backbone | Link |
89
- |---|---|---|---|
90
- | SwiftVR | 2026.06 | Wan2.2-TI2V-5B | [πŸ€— HuggingFace](https://huggingface.co/H-oliday/SwiftVR) |
91
-
92
- ```bash
93
- huggingface-cli download H-oliday/SwiftVR --local-dir checkpoints/
94
- ```
95
-
96
- Expected checkpoint layout (the directory passed to `from_pretrained`):
97
-
98
- ```
99
- checkpoints/
100
- β”œβ”€β”€ reae.safetensors # Restoration-aware Autoencoder weights
101
- β”œβ”€β”€ prompt_embedding.safetensors# precomputed empty-prompt text embedding (key: "prompt_emb")
102
- └── transformer/ # diffusers-format DiT
103
- β”œβ”€β”€ config.json
104
- └── diffusion_pytorch_model.safetensors
105
- ```
106
-
107
-
108
  ## πŸš€ Quick Start
109
 
110
  ### Python API
@@ -112,25 +66,12 @@ checkpoints/
112
  ```python
113
  from swiftvr import SwiftVRPipeline
114
 
115
- pipe = SwiftVRPipeline.from_pretrained("checkpoints/").to("cuda", dtype="bfloat16")
116
 
117
  pipe.restore_video("low_quality.mp4", "restored.mp4", upscale=4)
118
  ```
119
 
120
- `restore_video` also accepts an image folder as input and can write a PNG sequence with `png_save=True`.
121
-
122
- Tunable knobs include:
123
-
124
- * `clip_len`: middle chunk size, multiple of 4
125
- * `dit_overlap`: overlap for DiT inference
126
- * `fps`: output video frame rate
127
- * `quality`: 0–100, mapped to x265 CRF
128
- * `queue_size`: pipeline queue size
129
-
130
-
131
- ### Streaming (causal, chunk by chunk, no future frames)
132
-
133
- Causal, chunk-by-chunk restoration without future frames.
134
 
135
  ```python
136
  session = pipe.stream(clip_len=24, resolution=(1920, 1080))
@@ -149,52 +90,27 @@ tail = session.flush() # flush the final buffered frames
149
  python scripts/inference.py \
150
  --input low_quality.mp4 \
151
  --output restored.mp4 \
152
- --checkpoint checkpoints/ \
153
  --upscale 4 \
154
  --clip-len 24 \
155
- --dtype bfloat16 \
156
  ```
157
 
158
- Use `--png` to write a PNG sequence.
159
-
160
-
161
- ## 🎬 More Visual Results
162
-
163
- > Full-length restored clips (low-quality input β†’ SwiftVR, played back to back).
164
-
165
 
166
  <video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_1.mp4" controls width="100%"></video>
167
 
168
- <video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_2.mp4" controls width="100%"></video>
169
-
170
- <video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_3.mp4" controls width="100%"></video>
171
-
172
-
173
-
174
-
175
  ## πŸ™ Acknowledgements
176
 
177
- SwiftVR builds on [Wan2.2-TI2V-5B](https://github.com/Wan-Video), the lightweight autoencoder [TAEHV](https://github.com/madebyollin/taehv), and the [RealBasicVSR](https://github.com/ckkelvinchan/RealBasicVSR) degradation pipeline. We thank the authors of [DOVE](https://github.com/zhengchen1999/DOVE), [SeedVR2](https://github.com/ByteDance-Seed/SeedVR), and [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) for releasing strong baselines, and the [UltraVideo](https://github.com/Tele-AI/UltraVideo) team for the training corpus.
178
-
179
-
180
-
181
- ## πŸ“œ License
182
-
183
- SwiftVR is released under the **Apache License 2.0**.
184
-
185
- Copyright 2026 SwiftVR Authors.
186
-
187
- Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at:
188
-
189
- https://www.apache.org/licenses/LICENSE-2.0
190
-
191
- Unless required by applicable law or agreed to in writing, this project is distributed on an **"AS IS" BASIS**, without warranties or conditions of any kind, either express or implied. See the [LICENSE](./LICENSE) file for the full license text.
192
-
193
-
194
-
195
- ## Contact
196
-
197
- If you have any questions, feel free to reach out:
198
 
199
- * Email: [kakibluee@gmail.com](mailto:kakibluee@gmail.com)
200
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-to-image
4
+ library_name: diffusers
5
  ---
6
 
7
  <h1 align="center">SwiftVR: Real-Time One-Step Generative Video Restoration</h1>
8
 
9
+ <p align="center"><img src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/teaser.avif" width="100%" alt="SwiftVR teaser"></p>
 
 
 
 
10
 
11
  > **SwiftVR** is the first generative video restoration model to reach **real-time 1080p streaming on a consumer-grade GPU** (β‰ˆ26 FPS on a single RTX 5090), sustains **31 FPS at QHD (2560Γ—1440)** and **14 FPS at 4K (3840Γ—2160)** on a single H100, and streams at resolutions where every compared diffusion-based VR baseline runs out of memory.
12
 
13
  <p>
14
  <a href="https://arxiv.org/abs/2606.09516"><img src="https://img.shields.io/badge/arXiv-2606.09516-b31b1b.svg?style=flat-square" alt="arXiv"></a>
15
+ <a href="https://h-oliday.github.io/SwiftVR"><img src="https://img.shields.io/badge/Project-Page-1f8acb.svg?style=flat-square" alt="Project Page"></a>
16
  <a href="https://github.com/H-oliday/SwiftVR">
17
  <img src="https://img.shields.io/badge/GitHub-Code-181717.svg?style=flat-square&logo=github" alt="GitHub">
18
  </a>
19
  <a href="https://github.com/H-oliday/SwiftVR/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=flat-square" alt="License"></a>
20
  </p>
21
 
22
+ SwiftVR is a streaming one-step generative video restoration (VR) framework presented in [SwiftVR: Real-Time One-Step Generative Video Restoration](https://arxiv.org/abs/2606.09516).
23
 
24
  ## Updates
25
 
26
  - [2026/06] Release the inference code and pretrained weights πŸŽ‰
27
 
 
 
 
 
28
  ## ✨ Highlights
29
 
30
  - **Mask-free shifted-window self-attention (MFSWA).** Each spatial window is **pre-gathered into a dense tensor**, so every attention call reduces to a single standard scaled-dot-product (SDPA) call β€” *no attention mask, cyclic shift, or padding ever enters the graph*. This gives a **1.62Γ— throughput gain over its full-attention teacher** at essentially identical quality, with **no dedicated sparse kernel**.
31
  - **Restoration-aware Autoencoder (ReAE).** A lightweight encoder–decoder jointly fine-tuned with the DiT in pixel space removes the heavy-3D-VAE / tiled-decoding bottleneck.
32
  - **Causal chunk-wise streaming.** A minimal causal protocol (no rolling KV cache, no overlapped DiT inference) bounds the temporal axis, confining the residual \(\mathcal{O}(N^2)\) cost to the spatial axes.
33
 
 
34
  ## πŸ“Š Results
35
 
36
  ### Efficiency at 2560Γ—1440 (single H100, causal streaming, 24 frames)
 
43
 
44
  > At **3840Γ—2160**, every compared diffusion-based VR baseline **OOMs** on a single H100; SwiftVR sustains **14 FPS**.
45
 
 
 
 
 
 
 
46
  ## πŸ›  Installation
47
 
48
  ```bash
49
+ git clone https://github.com/H-oliday/SwiftVR.git
50
  cd SwiftVR
51
 
52
  conda create -n swiftvr python=3.10 -y
 
59
  pip install -e .
60
  ```
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ## πŸš€ Quick Start
63
 
64
  ### Python API
 
66
  ```python
67
  from swiftvr import SwiftVRPipeline
68
 
69
+ pipe = SwiftVRPipeline.from_pretrained("H-oliday/SwiftVR").to("cuda", dtype="bfloat16")
70
 
71
  pipe.restore_video("low_quality.mp4", "restored.mp4", upscale=4)
72
  ```
73
 
74
+ ### Streaming (causal, chunk by chunk)
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ```python
77
  session = pipe.stream(clip_len=24, resolution=(1920, 1080))
 
90
  python scripts/inference.py \
91
  --input low_quality.mp4 \
92
  --output restored.mp4 \
93
+ --checkpoint H-oliday/SwiftVR \
94
  --upscale 4 \
95
  --clip-len 24 \
96
+ --dtype bfloat16
97
  ```
98
 
99
+ ## 🎬 Visual Results
 
 
 
 
 
 
100
 
101
  <video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_1.mp4" controls width="100%"></video>
102
 
 
 
 
 
 
 
 
103
  ## πŸ™ Acknowledgements
104
 
105
+ SwiftVR builds on [Wan2.2-TI2V-5B](https://github.com/Wan-Video), the lightweight autoencoder [TAEHV](https://github.com/madebyollin/taehv), and the [RealBasicVSR](https://github.com/ckkelvinchan/RealBasicVSR) degradation pipeline. We thank the authors of [DOVE](https://github.com/zhengchen1999/DOVE), [SeedVR2](https://github.com/ByteDance-Seed/SeedVR), and [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) for releasing strong baselines.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
+ ## πŸ“œ Citation
108
 
109
+ ```bibtex
110
+ @article{yan2026swiftvr,
111
+ title={SwiftVR: Real-Time One-Step Generative Video Restoration},
112
+ author={Yan, Jiaqi and Chen, Xiangyu and Zhong, Xinlin and Huang, Haibin and Zhang, Chi and Liu, Jie and Zhou, Jiantao and Li, Xuelong},
113
+ journal={arXiv preprint arXiv:2606.09516},
114
+ year={2026}
115
+ }
116
+ ```