| --- |
| language: |
| - en |
| pipeline_tag: image-to-video |
| tags: |
| - image-to-video |
| - audio-conditioned |
| - diffusion |
| - talking-avatar |
| - pytorch |
| --- |
| |
| <div align="center"> |
|
|
| <h1>AvatarForcing</h1> |
|
|
| <h3>One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising</h3> |
|
|
| <p> |
| <a href="https://huggingface.co/papers/2603.14331"><img src="https://img.shields.io/badge/HuggingFace-Paper-ffbd45?style=for-the-badge&logo=huggingface&logoColor=white" alt="Hugging Face Paper"></a> |
| <a href="https://arxiv.org/abs/2603.14331"><img src="https://img.shields.io/badge/arXiv-2603.14331-b31b1b?style=for-the-badge" alt="arXiv"></a> |
| <a href="https://cuiliyuan121.github.io/AvatarForcing/"><img src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge&logo=googlechrome&logoColor=white" alt="Project Page"></a> |
| </p> |
|
|
| </div> |
|
|
| AvatarForcing is a **one-step streaming diffusion** framework for talking avatars. It generates video from **one reference image + speech audio + (optional) text prompt**, using **local-future sliding-window denoising** with **heterogeneous noise levels** and **dual-anchor temporal forcing** for long-form stability. For method details, see: https://arxiv.org/abs/2603.14331 |
|
|
| This Hugging Face repo (`lycui/AvatarForcing`) provides two training-stage checkpoints: |
|
|
| - `ode_audio_init.pt`: stage-1 **ODE** initialization weights |
| - `model.pt`: stage-2 **DMD** weights |
|
|
| ## Model Download |
|
|
| | Models | Download Link | Notes | |
| |---|---|---| |
| | Wan2.1-T2V-1.3B | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) | Base model (student) | |
| | AvatarForcing | 🤗 [Huggingface](https://huggingface.co/lycui/AvatarForcing) | `ode_audio_init.pt` (ODE) + `model.pt` (DMD) | |
| | Wav2Vec | 🤗 [Huggingface](https://huggingface.co/facebook/wav2vec2-base-960h) | Audio encoder | |
|
|
| Download models using `huggingface-cli`: |
|
|
| ```sh |
| pip install "huggingface_hub[cli]" |
| mkdir -p pretrained_models |
| |
| huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./pretrained_models/Wan2.1-T2V-1.3B |
| huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h |
| huggingface-cli download lycui/AvatarForcing --local-dir ./pretrained_models/AvatarForcing |
| ``` |
|
|
| <details> |
| <summary><strong>Citation</strong></summary> |
|
|
| ```bibtex |
| @misc{cui2026avatarforcingonestepstreamingtalking, |
| title={AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising}, |
| author={Liyuan Cui and Wentao Hu and Wenyuan Zhang and Zesong Yang and Fan Shi and Xiaoqiang Liu}, |
| year={2026}, |
| eprint={2603.14331}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2603.14331}, |
| } |
| ``` |
|
|
| </details> |
|
|