H-oliday
/

SwiftVR

@@ -1,44 +1,36 @@
 ---
 license: apache-2.0
 ---
 <h1 align="center">SwiftVR: Real-Time One-Step Generative Video Restoration</h1>
-<p align="center"><img src="assets/teaser.avif" width="100%" alt="SwiftVR teaser"></p>
 > **SwiftVR** is the first generative video restoration model to reach **real-time 1080p streaming on a consumer-grade GPU** (≈26 FPS on a single RTX 5090), sustains **31 FPS at QHD (2560×1440)** and **14 FPS at 4K (3840×2160)** on a single H100, and streams at resolutions where every compared diffusion-based VR baseline runs out of memory.
 <p>
   <a href="https://arxiv.org/abs/2606.09516"><img src="https://img.shields.io/badge/arXiv-2606.09516-b31b1b.svg?style=flat-square" alt="arXiv"></a>
-  <a href="https://github.com/H-oliday/SwiftVR/"><img src="https://img.shields.io/badge/Project-Page-1f8acb.svg?style=flat-square" alt="Project Page"></a>
   <a href="https://github.com/H-oliday/SwiftVR">
   <img src="https://img.shields.io/badge/GitHub-Code-181717.svg?style=flat-square&logo=github" alt="GitHub">
 </a>
   <a href="https://github.com/H-oliday/SwiftVR/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=flat-square" alt="License"></a>
 </p>
 ## Updates
 - [2026/06] Release the inference code and pretrained weights 🎉
 ## ✨ Highlights
 - **Mask-free shifted-window self-attention (MFSWA).** Each spatial window is **pre-gathered into a dense tensor**, so every attention call reduces to a single standard scaled-dot-product (SDPA) call — *no attention mask, cyclic shift, or padding ever enters the graph*. This gives a **1.62× throughput gain over its full-attention teacher** at essentially identical quality, with **no dedicated sparse kernel**.
 - **Restoration-aware Autoencoder (ReAE).** A lightweight encoder–decoder jointly fine-tuned with the DiT in pixel space removes the heavy-3D-VAE / tiled-decoding bottleneck.
 - **Causal chunk-wise streaming.** A minimal causal protocol (no rolling KV cache, no overlapped DiT inference) bounds the temporal axis, confining the residual \(\mathcal{O}(N^2)\) cost to the spatial axes.
 ## 📊 Results
 ### Efficiency at 2560×1440 (single H100, causal streaming, 24 frames)
@@ -51,16 +43,10 @@ license: apache-2.0
 > At **3840×2160**, every compared diffusion-based VR baseline **OOMs** on a single H100; SwiftVR sustains **14 FPS**.
-### Qualitative comparison
-<img src="assets/qualitative.png" width="100%" alt="SwiftVR teaser">
 ## 🛠 Installation
 ```bash
-git clone https://github.com/Holiday/SwiftVR.git
 cd SwiftVR
 conda create -n swiftvr python=3.10 -y
@@ -73,38 +59,6 @@ pip install torch==2.10.0 torchvision==0.25.0 --index-url https://download.pytor
 pip install -e .
 ```
-<details>
-<summary><b>Hardware notes</b></summary>
-- **Server:** single H100-80G reproduces the QHD/4K numbers above.
-- **Consumer:** single RTX 5090 reaches ≈26 FPS at 1080p with the *same checkpoint* (default PyTorch SDPA path, bfloat16, causal chunk protocol).
-- No hardware-specific retraining or kernel rewrite is required on any platform.
-</details>
-## 🗂 Model Zoo
-| Model Name | Date | Backbone | Link |
-|---|---|---|---|
-| SwiftVR | 2026.06 | Wan2.2-TI2V-5B | [🤗 HuggingFace](https://huggingface.co/H-oliday/SwiftVR) |
-```bash
-huggingface-cli download H-oliday/SwiftVR --local-dir checkpoints/
-```
-Expected checkpoint layout (the directory passed to `from_pretrained`):
-```
-checkpoints/
-├── reae.safetensors            # Restoration-aware Autoencoder weights
-├── prompt_embedding.safetensors# precomputed empty-prompt text embedding (key: "prompt_emb")
-└── transformer/                # diffusers-format DiT
-    ├── config.json
-    └── diffusion_pytorch_model.safetensors
-```
 ## 🚀 Quick Start
 ### Python API
@@ -112,25 +66,12 @@ checkpoints/
 ```python
 from swiftvr import SwiftVRPipeline
-pipe = SwiftVRPipeline.from_pretrained("checkpoints/").to("cuda", dtype="bfloat16")
 pipe.restore_video("low_quality.mp4", "restored.mp4", upscale=4)
 ```
-`restore_video` also accepts an image folder as input and can write a PNG sequence with `png_save=True`.
-Tunable knobs include:
-* `clip_len`: middle chunk size, multiple of 4
-* `dit_overlap`: overlap for DiT inference
-* `fps`: output video frame rate
-* `quality`: 0–100, mapped to x265 CRF
-* `queue_size`: pipeline queue size
-### Streaming (causal, chunk by chunk, no future frames)
-Causal, chunk-by-chunk restoration without future frames.
 ```python
 session = pipe.stream(clip_len=24, resolution=(1920, 1080))
@@ -149,52 +90,27 @@ tail = session.flush()                 # flush the final buffered frames
 python scripts/inference.py \
   --input low_quality.mp4 \
   --output restored.mp4 \
-  --checkpoint checkpoints/ \
   --upscale 4 \
   --clip-len 24 \
-  --dtype bfloat16 \
 ```
-Use `--png` to write a PNG sequence.
-## 🎬 More Visual Results
-> Full-length restored clips (low-quality input → SwiftVR, played back to back).
 <video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_1.mp4" controls width="100%"></video>
-<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_2.mp4" controls width="100%"></video>
-<video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_3.mp4" controls width="100%"></video>
 ## 🙏 Acknowledgements
-SwiftVR builds on [Wan2.2-TI2V-5B](https://github.com/Wan-Video), the lightweight autoencoder [TAEHV](https://github.com/madebyollin/taehv), and the [RealBasicVSR](https://github.com/ckkelvinchan/RealBasicVSR) degradation pipeline. We thank the authors of [DOVE](https://github.com/zhengchen1999/DOVE), [SeedVR2](https://github.com/ByteDance-Seed/SeedVR), and [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) for releasing strong baselines, and the [UltraVideo](https://github.com/Tele-AI/UltraVideo) team for the training corpus.
-## 📜 License
-SwiftVR is released under the **Apache License 2.0**.
-Copyright 2026 SwiftVR Authors.
-Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at:
-https://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, this project is distributed on an **"AS IS" BASIS**, without warranties or conditions of any kind, either express or implied. See the [LICENSE](./LICENSE) file for the full license text.
-## Contact
-If you have any questions, feel free to reach out:
-* Email: [kakibluee@gmail.com](mailto:kakibluee@gmail.com)

 ---
 license: apache-2.0
+pipeline_tag: image-to-image
+library_name: diffusers
 ---
 <h1 align="center">SwiftVR: Real-Time One-Step Generative Video Restoration</h1>
+<p align="center"><img src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/teaser.avif" width="100%" alt="SwiftVR teaser"></p>
 > **SwiftVR** is the first generative video restoration model to reach **real-time 1080p streaming on a consumer-grade GPU** (≈26 FPS on a single RTX 5090), sustains **31 FPS at QHD (2560×1440)** and **14 FPS at 4K (3840×2160)** on a single H100, and streams at resolutions where every compared diffusion-based VR baseline runs out of memory.
 <p>
   <a href="https://arxiv.org/abs/2606.09516"><img src="https://img.shields.io/badge/arXiv-2606.09516-b31b1b.svg?style=flat-square" alt="arXiv"></a>
+  <a href="https://h-oliday.github.io/SwiftVR"><img src="https://img.shields.io/badge/Project-Page-1f8acb.svg?style=flat-square" alt="Project Page"></a>
   <a href="https://github.com/H-oliday/SwiftVR">
   <img src="https://img.shields.io/badge/GitHub-Code-181717.svg?style=flat-square&logo=github" alt="GitHub">
 </a>
   <a href="https://github.com/H-oliday/SwiftVR/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=flat-square" alt="License"></a>
 </p>
+SwiftVR is a streaming one-step generative video restoration (VR) framework presented in [SwiftVR: Real-Time One-Step Generative Video Restoration](https://arxiv.org/abs/2606.09516).
 ## Updates
 - [2026/06] Release the inference code and pretrained weights 🎉
 ## ✨ Highlights
 - **Mask-free shifted-window self-attention (MFSWA).** Each spatial window is **pre-gathered into a dense tensor**, so every attention call reduces to a single standard scaled-dot-product (SDPA) call — *no attention mask, cyclic shift, or padding ever enters the graph*. This gives a **1.62× throughput gain over its full-attention teacher** at essentially identical quality, with **no dedicated sparse kernel**.
 - **Restoration-aware Autoencoder (ReAE).** A lightweight encoder–decoder jointly fine-tuned with the DiT in pixel space removes the heavy-3D-VAE / tiled-decoding bottleneck.
 - **Causal chunk-wise streaming.** A minimal causal protocol (no rolling KV cache, no overlapped DiT inference) bounds the temporal axis, confining the residual \(\mathcal{O}(N^2)\) cost to the spatial axes.
 ## 📊 Results
 ### Efficiency at 2560×1440 (single H100, causal streaming, 24 frames)
 > At **3840×2160**, every compared diffusion-based VR baseline **OOMs** on a single H100; SwiftVR sustains **14 FPS**.
 ## 🛠 Installation
 ```bash
+git clone https://github.com/H-oliday/SwiftVR.git
 cd SwiftVR
 conda create -n swiftvr python=3.10 -y
 pip install -e .
 ```
 ## 🚀 Quick Start
 ### Python API
 ```python
 from swiftvr import SwiftVRPipeline
+pipe = SwiftVRPipeline.from_pretrained("H-oliday/SwiftVR").to("cuda", dtype="bfloat16")
 pipe.restore_video("low_quality.mp4", "restored.mp4", upscale=4)
 ```
+### Streaming (causal, chunk by chunk)
 ```python
 session = pipe.stream(clip_len=24, resolution=(1920, 1080))
 python scripts/inference.py \
   --input low_quality.mp4 \
   --output restored.mp4 \
+  --checkpoint H-oliday/SwiftVR \
   --upscale 4 \
   --clip-len 24 \
+  --dtype bfloat16
 ```
+## 🎬 Visual Results
 <video src="https://huggingface.co/H-oliday/SwiftVR/resolve/main/assets/demo_1.mp4" controls width="100%"></video>
 ## 🙏 Acknowledgements
+SwiftVR builds on [Wan2.2-TI2V-5B](https://github.com/Wan-Video), the lightweight autoencoder [TAEHV](https://github.com/madebyollin/taehv), and the [RealBasicVSR](https://github.com/ckkelvinchan/RealBasicVSR) degradation pipeline. We thank the authors of [DOVE](https://github.com/zhengchen1999/DOVE), [SeedVR2](https://github.com/ByteDance-Seed/SeedVR), and [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) for releasing strong baselines.
+## 📜 Citation
+```bibtex
+@article{yan2026swiftvr,
+  title={SwiftVR: Real-Time One-Step Generative Video Restoration},
+  author={Yan, Jiaqi and Chen, Xiangyu and Zhong, Xinlin and Huang, Haibin and Zhang, Chi and Liu, Jie and Zhou, Jiantao and Li, Xuelong},
+  journal={arXiv preprint arXiv:2606.09516},
+  year={2026}
+}
+```