Text-to-Video
Diffusers
Safetensors
WanPipeline
File size: 3,280 Bytes
d53f8ab
 
38277dd
 
 
d53f8ab
 
bb036de
38277dd
 
 
 
 
 
 
 
 
 
e98cf74
38277dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91bc7cc
 
38277dd
 
 
 
 
 
621c6ae
 
 
 
 
 
 
 
 
 
 
38277dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: apache-2.0
pipeline_tag: text-to-video
library_name: diffusers
arxiv: 2603.00040
---

# FastWan-QAD-1.3B
<p align="center">
  <img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logos/logo.svg" width="200"/>
</p>
<div>
  <div align="center">
    <a href="https://github.com/hao-ai-lab/FastVideo">Github</a> |
    <a href="https://haoailab.com/blogs/fastwan-qad/">Blog</a> |
    <a href="https://hao-ai-lab.github.io/FastVideo">Documentation</a>
  </div>
</div>

## Introduction

FastWan-QAD-1.3B is the fastest variant of the FastWan-QAD series, targeting RTX 5090 users. It uses **NVFP4 quantized linear layers** paired with the **SageAttention3 FP4 attention backend**, achieving end-to-end generation of a 5-second 480p video in **1.78 seconds** — over 3.4× faster than prior distilled models on the same hardware.

The model is built on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers) and trained with **quantization-aware distillation (QAD)**, jointly optimizing for low-bit precision and 3-step inference quality.

> **Hardware requirement:** RTX 5090 (sm100+). NVFP4 is a Blackwell-native format and is not supported on older GPUs. See [FastWan-QAD-1.3B-SA2](https://huggingface.co/FastVideo/FastWan-QAD-1.3B-SA2) for an alternative using SageAttention2++ or [FastWan-QAD-FP8-1.3B](https://huggingface.co/FastVideo/FastWan-QAD-FP8-1.3B) for RTX 4090 support.

---

## Model Overview

- **3-step inference** via quantization-aware distillation
- **NVFP4 linear layers** for maximum throughput on Blackwell GPUs
- **SageAttention3 FP4 backend** for attention computation
- Trained at **480p (832×480)** resolution, 81 frames (5 seconds at 16 fps)
- No classifier-free guidance at inference time
- Fast decoding via [TAEHV](https://github.com/madebyollin/taehv) tiny autoencoder

## Performance

| Model | Hardware | Generation Time (5s 480p) |
|---|---|---|
| FastWan-QAD-1.3B | RTX 5090 | **1.78s** |
| [FastWan-QAD-1.3B-SA2](https://huggingface.co/FastVideo/FastWan-QAD-1.3B-SA2) | RTX 5090 | ~2.0s |
| [FastWan-QAD-FP8-1.3B](https://huggingface.co/FastVideo/FastWan-QAD-FP8-1.3B) | RTX 4090 | ~3.4s |
| TurboDiffusion | RTX 5090 | 6.10s |
| LightX2V | RTX 5090 | 6.91s |

## Inference

```bash
docker run --gpus all --ipc=host --rm -it ghcr.io/hao-ai-lab/fastvideo/fastvideo-dev:py3.12-sha-f889e6b bash

# should drop you in /FastVideo with venv already activated
git fetch && git checkout main
# build fastvideo-kernel
cd fastvideo-kernels/ && ./build.sh && cd ..
git clone https://github.com/madebyollin/taehv
uv pip install ./taehv

# run generation:
FASTVIDEO_DISABLE_ATTENTION_COMPILE=0 FASTVIDEO_ATTENTION_BACKEND=ATTN_QAT_INFER python examples/inference/optimizations/FastWan_QAD_TAEHV.py --model FastVideo/FastWan-QAD-1.3B --distilled_model "" --taehv_checkpoint taehv/taew2_1.pth
```

## Training

More details coming soon.

---

It would be greatly appreciated if you cite our paper:
```
@article{Zhang2026AttnQAT,
  title={Attn-QAT: 4-Bit Attention With Quantization-Aware Training},
  author={Zhang, Peiyuan and Noto, Matthew and Tan, Wenxuan and Jiang, Chengquan and Lin, Will and Zhou, Wei and Zhang, Hao},
  journal={arXiv preprint arXiv:2603.00040},
  year={2026}
}
```