Video-to-Video
Diffusers
Safetensors
H-oliday commited on
Commit
3921dda
Β·
verified Β·
1 Parent(s): 68cbc45

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +227 -1
README.md CHANGED
@@ -1,3 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ # RVR: One-step Generative Streaming Real-time Video Restoration
4
+
5
+
6
+
7
+ <img src="assets/teaser.avif" width="100%" alt="RVR teaser">
8
+
9
+
10
+ </div>
11
+
12
+ > **RVR** is the first generative video restoration model to reach **real-time 1080p streaming on a consumer-grade GPU** (β‰ˆ26 FPS on a single RTX 5090), sustains **31 FPS at QHD (2560Γ—1440)** and **14 FPS at 4K (3840Γ—2160)** on a single H100, and streams at resolutions where every compared diffusion-based VR baseline runs out of memory.
13
+
14
+ <p>
15
+ <a href="https://arxiv.org/abs/XXXX.XXXXX"><img src="https://img.shields.io/badge/arXiv-XXXX.XXXXX-b31b1b.svg?style=flat-square" alt="arXiv"></a>
16
+ <a href="https://github.com/H-oliday/RVR/"><img src="https://img.shields.io/badge/Project-Page-1f8acb.svg?style=flat-square" alt="Project Page"></a>
17
+ <a href="https://huggingface.co/H-oliday/RVR"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-ffce00.svg?style=flat-square" alt="HuggingFace"></a>
18
+ <a href="https://github.com/H-oliday/RVR/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=flat-square" alt="License"></a>
19
+
20
+ </p>
21
+
22
+
23
+
24
+ ## Updates
25
  ---
26
+ - [2026/06] Release the inference code and pretrained weights πŸŽ‰
27
+
28
+
29
+
30
+
31
+
32
+ ## ✨ Highlights
33
  ---
34
+
35
+ - **Mask-free shifted-window self-attention (MFSWA).** Each spatial window is **pre-gathered into a dense tensor**, so every attention call reduces to a single standard scaled-dot-product (SDPA) call β€” *no attention mask, cyclic shift, or padding ever enters the graph*. This gives a **1.62Γ— throughput gain over its full-attention teacher** at essentially identical quality, with **no dedicated sparse kernel**.
36
+ - **Restoration-aware Autoencoder (ReAE).** A lightweight encoder–decoder jointly fine-tuned with the DiT in pixel space removes the heavy-3D-VAE / tiled-decoding bottleneck.
37
+ - **Causal chunk-wise streaming.** A minimal causal protocol (no rolling KV cache, no overlapped DiT inference) bounds the temporal axis, confining the residual \(\mathcal{O}(N^2)\) cost to the spatial axes.
38
+ - **Kernel-agnostic & portable.** The same checkpoint runs **bit-identically** across PyTorch SDPA, FlashAttention-2/3, SageAttention, and xFormers β€” no retraining, weight conversion, or kernel rewrite.
39
+
40
+
41
+ ## πŸ“Š Results
42
+ ---
43
+
44
+ ### Efficiency at 2560Γ—1440 (single H100, causal streaming, 24 frames)
45
+
46
+ | Metric | SeedVR2-3B (tile)| DOVE (tile)| FlashVSR-Tiny | **RVR (Ours)** |
47
+ |---|:---:|:---:|:---:|:---:|
48
+ | Avg. Time (s) ↓ | 17.320 | 27.615 | 2.493 | **0.766** |
49
+ | **FPS ↑** | 1.39 | 0.85 | 9.61 | **31.32** |
50
+ | Peak Mem. (GB) ↓ | 35.35 | 59.24 | 34.35 | 38.01 |
51
+
52
+ > At **3840Γ—2160**, every compared diffusion-based VR baseline **OOMs** on a single H100; RVR sustains **14 FPS**.
53
+
54
+ ### Qualitative comparison
55
+
56
+ <img src="assets/qualitative.png" width="100%" alt="RVR teaser">
57
+
58
+
59
+
60
+ ## πŸ›  Installation
61
+ ---
62
+
63
+ ```bash
64
+ git clone https://github.com/Holiday/RVR.git
65
+ cd RVR
66
+
67
+ conda create -n rvr python=3.10 -y
68
+ conda activate rvr
69
+
70
+ # Install PyTorch matching your CUDA toolkit first, e.g. CUDA 12.4:
71
+ pip install torch==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu124
72
+
73
+ # Install RVR (editable) and its dependencies:
74
+ pip install -e .
75
+ ```
76
+
77
+ <details>
78
+ <summary><b>Hardware notes</b></summary>
79
+
80
+ - **Server:** single H100-80G reproduces the QHD/4K numbers above.
81
+ - **Consumer:** single RTX 5090 reaches β‰ˆ26 FPS at 1080p with the *same checkpoint* (default PyTorch SDPA path, bfloat16, causal chunk protocol).
82
+ - No hardware-specific retraining or kernel rewrite is required on any platform.
83
+ </details>
84
+
85
+
86
+
87
+ ## πŸ—‚ Model Zoo
88
+ ---
89
+
90
+ | Model Name | Date | Backbone | Link |
91
+ |---|---|---|---|
92
+ | RVR | 2026.06 | Wan2.2-TI2V-5B | [πŸ€— HuggingFace](https://huggingface.co/H-oliday/RVR) |
93
+
94
+ ```bash
95
+ huggingface-cli download H-oliday/RVR --local-dir checkpoints/
96
+ ```
97
+
98
+ Expected checkpoint layout (the directory passed to `from_pretrained`):
99
+
100
+ ```
101
+ checkpoints/
102
+ β”œβ”€β”€ reae.safetensors # Restoration-aware Autoencoder weights
103
+ β”œβ”€β”€ prompt_embedding.safetensors# precomputed empty-prompt text embedding (key: "prompt_emb")
104
+ └── transformer/ # diffusers-format DiT
105
+ β”œβ”€β”€ config.json
106
+ └── diffusion_pytorch_model.safetensors
107
+ ```
108
+
109
+
110
+ ## πŸš€ Quick Start
111
+ ---
112
+
113
+ ### Python API
114
+
115
+ ```python
116
+ from rvr import RVRPipeline
117
+
118
+ pipe = RVRPipeline.from_pretrained("checkpoints/").to("cuda", dtype="bfloat16")
119
+
120
+ pipe.restore_video("low_quality.mp4", "restored.mp4", upscale=4)
121
+ ```
122
+
123
+ `restore_video` also accepts an image folder as input and can write a PNG sequence
124
+ (`png_save=True`). Tunable knobs: `clip_len` (MIDDLE chunk size, multiple of 4),
125
+ `dit_overlap`, `fps`, `quality` (0–100, mapped to x265 CRF), `queue_size`.
126
+
127
+ ### Streaming (causal, chunk by chunk, no future frames)
128
+
129
+ ```python
130
+ session = pipe.stream(clip_len=24, resolution=(1920, 1080))
131
+
132
+ for lq_chunk in read_chunks("low_quality.mp4", n=24): # lq_chunk: [T, H, W, 3] uint8
133
+ hq = session.step(lq_chunk) # [1, T', 3, H', W'] in [0, 1], or None if buffered
134
+ if hq is not None:
135
+ write(hq)
136
+
137
+ tail = session.flush() # flush the final buffered frames
138
+ ```
139
+
140
+ ### Command line
141
+
142
+ ```bash
143
+ python scripts/inference.py \
144
+ --input low_quality.mp4 \
145
+ --output restored.mp4 \
146
+ --checkpoint checkpoints/ \
147
+ --upscale 4 \
148
+ --clip-len 24 \
149
+ --dtype bfloat16 \
150
+ ```
151
+
152
+ Use `--upscale 4` instead of `--resolution`, or `--png` to write a PNG sequence.
153
+
154
+
155
+
156
+
157
+ ## πŸ“ Repository Structure
158
+ ---
159
+
160
+ ```
161
+ RVR/
162
+ β”œβ”€β”€ README.md
163
+ β”œβ”€β”€ requirements.txt
164
+ β”œβ”€β”€ setup.py
165
+ β”œβ”€β”€ scripts/
166
+ β”‚ └── inference.py # CLI entry point (thin wrapper over RVRPipeline)
167
+ └── rvr/
168
+ β”œβ”€β”€ __init__.py # exports RVRPipeline
169
+ β”œβ”€β”€ pipeline.py # RVRPipeline: from_pretrained / to / restore_video / stream
170
+ β”œβ”€β”€ runner.py # four-stage pipelined runner (reader β†’ H2D β†’ GPU β†’ writer)
171
+ β”œβ”€β”€ io.py # frame reading, GPU preprocessing, mp4 / PNG writing
172
+ β”œβ”€β”€ models/
173
+ β”‚ β”œβ”€β”€ reae.py # β˜… Restoration-aware Autoencoder
174
+ β”‚ └── transformer.py # β˜… DiT + mask-free shifted-window self-attention
175
+ └── streaming/
176
+ β”œβ”€β”€ chunk.py # fixed-size causal chunk protocol
177
+ β”œβ”€β”€ tae.py # streaming autoencoder (causal boundary state)
178
+ └── dit.py # one-step streaming DiT (fixed timestep, RoPE offsets)
179
+ ```
180
+
181
+ > β˜… marks the two contribution-critical files: the MFSWA processor in `transformer.py` and `reae.py`.
182
+
183
+
184
+
185
+
186
+ ## 🎬 More Visual Results
187
+ ---
188
+
189
+ > Full-length restored clips (low-quality input β†’ RVR, played back to back).
190
+
191
+
192
+ widget:
193
+ - text: "Demo 1"
194
+ output:
195
+ url: assets/demo_1.mp4
196
+ - text: "Demo 2"
197
+ output:
198
+ url: assets/demo_2.mp4
199
+ - text: "Demo 3"
200
+ output:
201
+ url: assets/demo_3.mp4
202
+
203
+
204
+ ## πŸ“– Citation
205
+ ---
206
+ ```bibtex
207
+ @article{yan2026rvr,
208
+ title = {RVR: One-step Generative Streaming Real-time Video Restoration},
209
+ author = {Yan, Jiaqi and Chen, Xiangyu and Zhong, Xinlin and Liu, Jie and Zhou, Jiantao and Li, Xuelong},
210
+ journal = {arXiv preprint arXiv:XXXX.XXXXX},
211
+ year = {2026}
212
+ }
213
+ ```
214
+
215
+
216
+
217
+ ## πŸ™ Acknowledgements
218
+ ---
219
+ RVR builds on [Wan2.2-TI2V-5B](https://github.com/Wan-Video), the lightweight autoencoder [TAEHV](https://github.com/madebyollin/taehv), and the [RealBasicVSR](https://github.com/ckkelvinchan/RealBasicVSR) degradation pipeline. We thank the authors of [SeedVR2](https://github.com/ByteDance-Seed/SeedVR), [DOVE](https://github.com/zhengchen1999/DOVE), and [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) for releasing strong baselines, and the [UltraVideo](https://github.com/Tele-AI/UltraVideo) team for the training corpus.
220
+
221
+
222
+
223
+ ## πŸ“œ License
224
+ ---
225
+ Released under the [Apache 2.0 License](LICENSE). The Wan2.2 backbone and any third-party weights remain subject to their original licenses.
226
+
227
+ <div align="center">
228
+ <sub>If RVR is useful to your research or product, please consider giving it a ⭐.</sub>
229
+ </div>