Diffusers
Safetensors
SwiftVR / README.md
H-oliday's picture
Update README.md
ff5ca16 verified
---
license: apache-2.0
---
<h1 align="center">SwiftVR: Real-Time One-Step Generative Video Restoration</h1>
<p align="center"><img src="assets/teaser.avif" width="100%" alt="SwiftVR teaser"></p>
> **SwiftVR** is the first generative video restoration model to reach **real-time 1080p streaming on a consumer-grade GPU** (β‰ˆ26 FPS on a single RTX 5090), sustains **31 FPS at QHD (2560Γ—1440)** and **14 FPS at 4K (3840Γ—2160)** on a single H100, and streams at resolutions where every compared diffusion-based VR baseline runs out of memory.
<p>
<a href="https://arxiv.org/abs/2606.09516"><img src="https://img.shields.io/badge/arXiv-2606.09516-b31b1b.svg?style=flat-square" alt="arXiv"></a>
<a href="https://github.com/H-oliday/SwiftVR/"><img src="https://img.shields.io/badge/Project-Page-1f8acb.svg?style=flat-square" alt="Project Page"></a>
<a href="https://github.com/H-oliday/SwiftVR">
<img src="https://img.shields.io/badge/GitHub-Code-181717.svg?style=flat-square&logo=github" alt="GitHub">
</a>
<a href="https://github.com/H-oliday/SwiftVR/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=flat-square" alt="License"></a>
</p>
## Updates
- [2026/06] Release the inference code and pretrained weights πŸŽ‰
## ✨ Highlights
- **Mask-free shifted-window self-attention (MFSWA).** Each spatial window is **pre-gathered into a dense tensor**, so every attention call reduces to a single standard scaled-dot-product (SDPA) call β€” *no attention mask, cyclic shift, or padding ever enters the graph*. This gives a **1.62Γ— throughput gain over its full-attention teacher** at essentially identical quality, with **no dedicated sparse kernel**.
- **Restoration-aware Autoencoder (ReAE).** A lightweight encoder–decoder jointly fine-tuned with the DiT in pixel space removes the heavy-3D-VAE / tiled-decoding bottleneck.
- **Causal chunk-wise streaming.** A minimal causal protocol (no rolling KV cache, no overlapped DiT inference) bounds the temporal axis, confining the residual \(\mathcal{O}(N^2)\) cost to the spatial axes.
## πŸ“Š Results
### Efficiency at 2560Γ—1440 (single H100, causal streaming, 24 frames)
| Metric | DOVE (tile) | SeedVR2-3B (tile)| FlashVSR-Tiny | **SwiftVR (Ours)** |
|---|:---:|:---:|:---:|:---:|
| Avg. Time (s) ↓ | 27.615 | 17.320 | 2.493 | 0.766 |
| FPS ↑ | 0.85 | 1.39 | 9.61 | 31.32 |
| Peak Mem. (GB) ↓ | 59.24 | 35.35 | 34.35 | 38.01 |
> At **3840Γ—2160**, every compared diffusion-based VR baseline **OOMs** on a single H100; SwiftVR sustains **14 FPS**.
### Qualitative comparison
<img src="assets/qualitative.png" width="100%" alt="SwiftVR teaser">
## πŸ›  Installation
```bash
git clone https://github.com/Holiday/SwiftVR.git
cd SwiftVR
conda create -n swiftvr python=3.10 -y
conda activate swiftvr
# Install PyTorch matching your CUDA toolkit first, e.g. CUDA 12.4:
pip install torch==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu124
# Install SwiftVR (editable) and its dependencies:
pip install -e .
```
<details>
<summary><b>Hardware notes</b></summary>
- **Server:** single H100-80G reproduces the QHD/4K numbers above.
- **Consumer:** single RTX 5090 reaches β‰ˆ26 FPS at 1080p with the *same checkpoint* (default PyTorch SDPA path, bfloat16, causal chunk protocol).
- No hardware-specific retraining or kernel rewrite is required on any platform.
</details>
## πŸ—‚ Model Zoo
| Model Name | Date | Backbone | Link |
|---|---|---|---|
| SwiftVR | 2026.06 | Wan2.2-TI2V-5B | [πŸ€— HuggingFace](https://huggingface.co/H-oliday/SwiftVR) |
```bash
huggingface-cli download H-oliday/SwiftVR --local-dir checkpoints/
```
Expected checkpoint layout (the directory passed to `from_pretrained`):
```
checkpoints/
β”œβ”€β”€ reae.safetensors # Restoration-aware Autoencoder weights
β”œβ”€β”€ prompt_embedding.safetensors# precomputed empty-prompt text embedding (key: "prompt_emb")
└── transformer/ # diffusers-format DiT
β”œβ”€β”€ config.json
└── diffusion_pytorch_model.safetensors
```
## πŸš€ Quick Start
### Python API
```python
from swiftvr import SwiftVRPipeline
pipe = SwiftVRPipeline.from_pretrained("checkpoints/").to("cuda", dtype="bfloat16")
pipe.restore_video("low_quality.mp4", "restored.mp4", upscale=4)
```
`restore_video` also accepts an image folder as input and can write a PNG sequence with `png_save=True`.
Tunable knobs include:
* `clip_len`: middle chunk size, multiple of 4
* `dit_overlap`: overlap for DiT inference
* `fps`: output video frame rate
* `quality`: 0–100, mapped to x265 CRF
* `queue_size`: pipeline queue size
### Streaming (causal, chunk by chunk, no future frames)
Causal, chunk-by-chunk restoration without future frames.
```python
session = pipe.stream(clip_len=24, resolution=(1920, 1080))
for lq_chunk in read_chunks("low_quality.mp4", n=24): # lq_chunk: [T, H, W, 3] uint8
hq = session.step(lq_chunk) # [1, T', 3, H', W'] in [0, 1], or None if buffered
if hq is not None:
write(hq)
tail = session.flush() # flush the final buffered frames
```
### Command line
```bash
python scripts/inference.py \
--input low_quality.mp4 \
--output restored.mp4 \
--checkpoint checkpoints/ \
--upscale 4 \
--clip-len 24 \
--dtype bfloat16 \
```
Use `--png` to write a PNG sequence.
## 🎬 More Visual Results
> Full-length restored clips (low-quality input β†’ SwiftVR, played back to back).
<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_1.mp4" controls width="100%"></video>
<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_2.mp4" controls width="100%"></video>
<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_3.mp4" controls width="100%"></video>
## πŸ™ Acknowledgements
SwiftVR builds on [Wan2.2-TI2V-5B](https://github.com/Wan-Video), the lightweight autoencoder [TAEHV](https://github.com/madebyollin/taehv), and the [RealBasicVSR](https://github.com/ckkelvinchan/RealBasicVSR) degradation pipeline. We thank the authors of [DOVE](https://github.com/zhengchen1999/DOVE), [SeedVR2](https://github.com/ByteDance-Seed/SeedVR), and [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) for releasing strong baselines, and the [UltraVideo](https://github.com/Tele-AI/UltraVideo) team for the training corpus.
## πŸ“œ License
SwiftVR is released under the **Apache License 2.0**.
Copyright 2026 SwiftVR Authors.
Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at:
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, this project is distributed on an **"AS IS" BASIS**, without warranties or conditions of any kind, either express or implied. See the [LICENSE](./LICENSE) file for the full license text.
## Contact
If you have any questions, feel free to reach out:
* Email: [kakibluee@gmail.com](mailto:kakibluee@gmail.com)