license: apache-2.0
In this work, we present AMD Hummingbird-XT, an efficient DiT-based video generative model designed for high-quality video generation on client-grade GPUs with 5B parameters .
Hummingbird-XT is trained based on Wan2.2-5B-TI2V using DMD step distillation with carefully designed data curation, enabling 3-step generation while preserving high visual fidelity and motion quality. To reduce the computational overhead of high-resolution video decoding in 3D convolution–based VAE decoders, we introduce a lightweight and efficient VAE decoder by replacing part of the 3D convolutions with depthwise separable convolutions. Additionally, to further extend the length of generated videos, we introduce Hummingbird-XTX, an efficient autoregressive model for long-video generation based on Wan-2.1-1.3B, which is capable of generating long videos.
As a result, Hummingbird-XT achieves a 33× speedup on Strix Halo iGPU and a 40× speedup on AMD Instinct™ MI325, and supports generating 121-frame videos at 720×1280 resolution across both server-grade (AMD Instinct™ MI300 and AMD Instinct™ MI325) and client-grade (Strix Halo and Navi48) devices. Quantitative results on the VBench-T2V and VBench-I2V benchmarks show that Hummingbird-XT achieves competitive performance compared to the original Wan2.2-5B-TI2V model.
The Training and inference code is fully released on Hummingbird-XT, and the technical details is released on Bridging the Last Mile: Deploying Hummingbird-XT for Efficient Video Generation on AMD Consumer-Grade Platforms.
Text-to-Video Generation
Image-to-Video Generation
Long Video Generation