Text-to-Video

Add model card and metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +68 -0
README.md CHANGED
@@ -1,3 +1,71 @@
1
  ---
2
  license: apache-2.0
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-to-video
4
  ---
5
+
6
+ # Ultra Flash ⚡
7
+
8
+ **Ultra Flash** is a cascaded streaming framework capable of real-time high-resolution video generation. It achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU.
9
+
10
+ [**Paper**](https://huggingface.co/papers/2606.09150) | [**Project Page**](https://xin1u.github.io/UltraFlash/) | [**Code (GitHub)**](https://github.com/xin1u/UltraFlash)
11
+
12
+ ## Overview
13
+
14
+ While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions. Ultra Flash bridges this gap by cascading three key components after a low-resolution streaming generator:
15
+
16
+ 1. **Architecture-Preserving T2V-to-TV2V SR Training** with AIGC-oriented degradation.
17
+ 2. **Causal Streaming Latent Upsampler** (~2M params, <5% overhead) for spatiotemporal coherence.
18
+ 3. **Cascaded Streaming Optimization** (sparse distillation, DPO, and dynamic cache management).
19
+
20
+ ## Architecture
21
+
22
+ The framework follows a cascaded pipeline:
23
+ - **Self-Forcing Generator**: Based on Wan2.1-1.3B, producing 480P streaming latents.
24
+ - **Causal Latent Upsampler**: Performs 2x or 3x spatial upsampling in the latent space.
25
+ - **Sparse SR DiT**: Refines high-resolution latents using single-step denoising and block-sparse attention.
26
+ - **Tiny Decoder**: A causal memory network for efficient latent-to-pixel decoding at 1K/2K.
27
+
28
+ ## Quick Start
29
+
30
+ ### Installation
31
+
32
+ ```bash
33
+ conda create -n ultraflash python=3.10 -y
34
+ conda activate ultraflash
35
+ cd inference
36
+ pip install -r requirements.txt
37
+ pip install flash-attn --no-build-isolation
38
+
39
+ # Block Sparse Attention (CUDA kernel, required for SR DiT)
40
+ git clone https://github.com/mit-han-lab/Block-Sparse-Attention.git
41
+ cd Block-Sparse-Attention
42
+ pip install -e .
43
+ ```
44
+
45
+ ### Inference
46
+
47
+ For custom inference at high resolution:
48
+
49
+ ```bash
50
+ cd inference
51
+ python inference.py \
52
+ --config_path configs/self_forcing_dmd_4step.yaml \
53
+ --checkpoint_path checkpoints/self_forcing_dmd.pt \
54
+ --data_path prompts/examples.txt \
55
+ --output_folder outputs/ \
56
+ --use_ema \
57
+ --tiny_decoder \
58
+ --torch_compile \
59
+ --compile_sr_dit
60
+ ```
61
+
62
+ ## Citation
63
+
64
+ ```bibtex
65
+ @inproceedings{luxury2026ultraflash,
66
+ title={Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions},
67
+ author={Luxury and Huang, Jie and Fan, Zihao and Ma, Xiaoxiao and Li, Yuming and Zhuang, Jun-hao and Xue, Zeyue and Fu, Siming and Li, Haoran and Zhong, Mingchen and Zhang, Guohui and Ma, Shichen and Liu, Yijun and Shi, Jiaqi and Ma, Yanwen and Su, Yaofeng and Wang, Haoyu and Li, Yaowei and Zhang, Songchun and Jin, Weiyang and Bian, Yuxuan and Zhang, Shiyi and Xu, Haojun and Lu, Shuai and Han, Xin and Tang, Wei and Huang, Haoyang and Duan, Nan},
68
+ booktitle={arXiv preprint},
69
+ year={2026}
70
+ }
71
+ ```