FastVideo
/

FastWan-QAD-1.3B

Model card Files Files and versions

FastWan-QAD-1.3B / README.md

wlsaidhi's picture

Upload README.md with huggingface_hub

621c6ae verified 3 days ago

|

History Blame Contribute Delete

3.28 kB

	---
	license: apache-2.0
	pipeline_tag: text-to-video
	library_name: diffusers
	arxiv: 2603.00040
	---

	# FastWan-QAD-1.3B
	<p align="center">
	<img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logos/logo.svg" width="200"/>
	</p>
	<div>
	<div align="center">
	<a href="https://github.com/hao-ai-lab/FastVideo">Github</a> \|
	<a href="https://haoailab.com/blogs/fastwan-qad/">Blog</a> \|
	<a href="https://hao-ai-lab.github.io/FastVideo">Documentation</a>
	</div>
	</div>

	## Introduction

	FastWan-QAD-1.3B is the fastest variant of the FastWan-QAD series, targeting RTX 5090 users. It uses NVFP4 quantized linear layers paired with the SageAttention3 FP4 attention backend, achieving end-to-end generation of a 5-second 480p video in 1.78 seconds — over 3.4× faster than prior distilled models on the same hardware.

	The model is built on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers) and trained with quantization-aware distillation (QAD), jointly optimizing for low-bit precision and 3-step inference quality.

	> Hardware requirement: RTX 5090 (sm100+). NVFP4 is a Blackwell-native format and is not supported on older GPUs. See [FastWan-QAD-1.3B-SA2](https://huggingface.co/FastVideo/FastWan-QAD-1.3B-SA2) for an alternative using SageAttention2++ or [FastWan-QAD-FP8-1.3B](https://huggingface.co/FastVideo/FastWan-QAD-FP8-1.3B) for RTX 4090 support.

	---

	## Model Overview

	- 3-step inference via quantization-aware distillation
	- NVFP4 linear layers for maximum throughput on Blackwell GPUs
	- SageAttention3 FP4 backend for attention computation
	- Trained at 480p (832×480) resolution, 81 frames (5 seconds at 16 fps)
	- No classifier-free guidance at inference time
	- Fast decoding via [TAEHV](https://github.com/madebyollin/taehv) tiny autoencoder

	## Performance

	\| Model \| Hardware \| Generation Time (5s 480p) \|
	\|---\|---\|---\|
	\| FastWan-QAD-1.3B \| RTX 5090 \| 1.78s \|
	\| [FastWan-QAD-1.3B-SA2](https://huggingface.co/FastVideo/FastWan-QAD-1.3B-SA2) \| RTX 5090 \| ~2.0s \|
	\| [FastWan-QAD-FP8-1.3B](https://huggingface.co/FastVideo/FastWan-QAD-FP8-1.3B) \| RTX 4090 \| ~3.4s \|
	\| TurboDiffusion \| RTX 5090 \| 6.10s \|
	\| LightX2V \| RTX 5090 \| 6.91s \|

	## Inference

	```bash
	docker run --gpus all --ipc=host --rm -it ghcr.io/hao-ai-lab/fastvideo/fastvideo-dev:py3.12-sha-f889e6b bash

	# should drop you in /FastVideo with venv already activated
	git fetch && git checkout main
	# build fastvideo-kernel
	cd fastvideo-kernels/ && ./build.sh && cd ..
	git clone https://github.com/madebyollin/taehv
	uv pip install ./taehv

	# run generation:
	FASTVIDEO_DISABLE_ATTENTION_COMPILE=0 FASTVIDEO_ATTENTION_BACKEND=ATTN_QAT_INFER python examples/inference/optimizations/FastWan_QAD_TAEHV.py --model FastVideo/FastWan-QAD-1.3B --distilled_model "" --taehv_checkpoint taehv/taew2_1.pth
	```

	## Training

	More details coming soon.

	---

	It would be greatly appreciated if you cite our paper:
	```
	@article{Zhang2026AttnQAT,
	title={Attn-QAT: 4-Bit Attention With Quantization-Aware Training},
	author={Zhang, Peiyuan and Noto, Matthew and Tan, Wenxuan and Jiang, Chengquan and Lin, Will and Zhou, Wei and Zhang, Hao},
	journal={arXiv preprint arXiv:2603.00040},
	year={2026}
	}
	```