Add model card and metadata
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,3 +1,71 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-to-video
|
| 4 |
---
|
| 5 |
+
|
| 6 |
+
# Ultra Flash ⚡
|
| 7 |
+
|
| 8 |
+
**Ultra Flash** is a cascaded streaming framework capable of real-time high-resolution video generation. It achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU.
|
| 9 |
+
|
| 10 |
+
[**Paper**](https://huggingface.co/papers/2606.09150) | [**Project Page**](https://xin1u.github.io/UltraFlash/) | [**Code (GitHub)**](https://github.com/xin1u/UltraFlash)
|
| 11 |
+
|
| 12 |
+
## Overview
|
| 13 |
+
|
| 14 |
+
While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions. Ultra Flash bridges this gap by cascading three key components after a low-resolution streaming generator:
|
| 15 |
+
|
| 16 |
+
1. **Architecture-Preserving T2V-to-TV2V SR Training** with AIGC-oriented degradation.
|
| 17 |
+
2. **Causal Streaming Latent Upsampler** (~2M params, <5% overhead) for spatiotemporal coherence.
|
| 18 |
+
3. **Cascaded Streaming Optimization** (sparse distillation, DPO, and dynamic cache management).
|
| 19 |
+
|
| 20 |
+
## Architecture
|
| 21 |
+
|
| 22 |
+
The framework follows a cascaded pipeline:
|
| 23 |
+
- **Self-Forcing Generator**: Based on Wan2.1-1.3B, producing 480P streaming latents.
|
| 24 |
+
- **Causal Latent Upsampler**: Performs 2x or 3x spatial upsampling in the latent space.
|
| 25 |
+
- **Sparse SR DiT**: Refines high-resolution latents using single-step denoising and block-sparse attention.
|
| 26 |
+
- **Tiny Decoder**: A causal memory network for efficient latent-to-pixel decoding at 1K/2K.
|
| 27 |
+
|
| 28 |
+
## Quick Start
|
| 29 |
+
|
| 30 |
+
### Installation
|
| 31 |
+
|
| 32 |
+
```bash
|
| 33 |
+
conda create -n ultraflash python=3.10 -y
|
| 34 |
+
conda activate ultraflash
|
| 35 |
+
cd inference
|
| 36 |
+
pip install -r requirements.txt
|
| 37 |
+
pip install flash-attn --no-build-isolation
|
| 38 |
+
|
| 39 |
+
# Block Sparse Attention (CUDA kernel, required for SR DiT)
|
| 40 |
+
git clone https://github.com/mit-han-lab/Block-Sparse-Attention.git
|
| 41 |
+
cd Block-Sparse-Attention
|
| 42 |
+
pip install -e .
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
### Inference
|
| 46 |
+
|
| 47 |
+
For custom inference at high resolution:
|
| 48 |
+
|
| 49 |
+
```bash
|
| 50 |
+
cd inference
|
| 51 |
+
python inference.py \
|
| 52 |
+
--config_path configs/self_forcing_dmd_4step.yaml \
|
| 53 |
+
--checkpoint_path checkpoints/self_forcing_dmd.pt \
|
| 54 |
+
--data_path prompts/examples.txt \
|
| 55 |
+
--output_folder outputs/ \
|
| 56 |
+
--use_ema \
|
| 57 |
+
--tiny_decoder \
|
| 58 |
+
--torch_compile \
|
| 59 |
+
--compile_sr_dit
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Citation
|
| 63 |
+
|
| 64 |
+
```bibtex
|
| 65 |
+
@inproceedings{luxury2026ultraflash,
|
| 66 |
+
title={Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions},
|
| 67 |
+
author={Luxury and Huang, Jie and Fan, Zihao and Ma, Xiaoxiao and Li, Yuming and Zhuang, Jun-hao and Xue, Zeyue and Fu, Siming and Li, Haoran and Zhong, Mingchen and Zhang, Guohui and Ma, Shichen and Liu, Yijun and Shi, Jiaqi and Ma, Yanwen and Su, Yaofeng and Wang, Haoyu and Li, Yaowei and Zhang, Songchun and Jin, Weiyang and Bian, Yuxuan and Zhang, Shiyi and Xu, Haojun and Lu, Shuai and Han, Xin and Tang, Wei and Huang, Haoyang and Duan, Nan},
|
| 68 |
+
booktitle={arXiv preprint},
|
| 69 |
+
year={2026}
|
| 70 |
+
}
|
| 71 |
+
```
|