SwiftVR: Real-Time One-Step Generative Video Restoration
Abstract
SwiftVR enables real-time video restoration on consumer GPUs through efficient attention mechanisms and lightweight autoencoding, achieving high frame rates at 4K resolution.
Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.
Community
SwiftVR is a one-step generative video restoration framework designed for real-time streaming. While recent one-step diffusion methods reduce the number of denoising steps, they still struggle with high-resolution deployment due to expensive spatial attention and heavy video autoencoders. The main contribution is a deployment-oriented design: SwiftVR uses mask-free shifted-window self-attention to keep attention on the standard dense SDPA path, avoiding attention masks, cyclic shifts, padding, custom sparse kernels, or hardware-specific retraining. It also introduces a lightweight restoration-aware autoencoder and a causal chunk-wise streaming protocol. The reported runtime is impressive: 54 FPS at 1080p, 31 FPS at 1440p, and 14 FPS at 4K on a single H100. On a consumer RTX 5090, SwiftVR reaches 26 FPS at 1080p, making real-time generative video restoration on consumer hardware much more practical. The method also achieves strong no-reference perceptual quality, especially on MUSIQ, CLIP-IQA, and DISTS, while producing sharper and more natural details in real-world videos. A nice aspect of this work is that the speedup comes from architecture and implementation choices that remain compatible with standard dense attention backends, rather than relying on custom sparse kernels. This makes SwiftVR not only fast, but also easier to deploy.
Get this paper in your agent:
hf papers read 2606.09516 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper