SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation

Le Shen*, Qian Qiao*, Tan Yu*, Ke Zhou, Tianhang Yu, Yu Zhan, Zhenjie Wang, Dingcheng Zhen, Ming Tao, Shunshun Yin, Siyuan Liu βœ‰

*Equal Contribution βœ‰Corresponding Author

HF space 

πŸ”₯ News

🀫 Coming soon

A 4-GPU version of SoulX-FlashTalk and a new open-source real-time streaming digital human model designed specifically for consumer-grade GPUs like 4090 etc.

πŸ“‘ Todo List

  • Technical report
  • Project Page
  • Inference code
  • Checkpoint release
  • Online demo

πŸ“– Quickstart

πŸ”§ Installation

1. Create a Conda environment

conda create -n flashtalk python=3.10
conda activate flashtalk

2. Install PyTorch on CUDA

pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128

3. Install other dependencies

pip install -r requirements.txt

4. Flash-attention installation:

pip install ninja
pip install flash_attn==2.8.0.post2 --no-build-isolation

5. FFmpeg installation

# Ubuntu / Debian
apt-get install ffmpeg
# CentOS / RHEL
yum install ffmpeg ffmpeg-devel

or

# Conda (no root required) 
conda install -c conda-forge ffmpeg==7

πŸ€— Model download

Model Component Description Link
SoulX-FlashTalk-14B Our 14b model πŸ€— Huggingface
chinese-wav2vec2-base chinese-wav2vec2-base πŸ€— Huggingface
# If you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
pip install "huggingface_hub[cli]"
huggingface-cli download Soul-AILab/SoulX-FlashTalk-14B --local-dir ./models/SoulX-FlashTalk-14B
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./models/chinese-wav2vec2-base

πŸš€ Inference

# Infer on single GPU
# Requires more than 64G of VRAM
bash inference_script_single_gpu.sh

# Infer on multy GPUs
# Real-time inference speed can only be supported on 8xH800 or higher graphics cards
bash inference_script_multi_gpu.sh

πŸ‘‹ Online Demo

Coming Soon!

πŸ“§ Contact Us

If you are interested in leaving a message to our work, feel free to email le.shen@mail.dhu.edu.cn or qiaoqian@soulapp.cn or yutan@soulapp.cn or zhouke@soulapp.cn or liusiyuan@soulapp.cn

You’re welcome to join our WeChat group for technical discussions, updates.


WeChat Group QR Code

πŸ“š Citation

If you find our work useful in your research, please consider citing:

@misc{shen2025soulxflashtalktechnicalreport,
      title={SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation}, 
      author={Le Shen and Qian Qiao and Tan Yu and Ke Zhou and Tianhang Yu and Yu Zhan and Zhenjie Wang and Ming Tao and Shunshun Yin and Siyuan Liu},
      year={2025},
      eprint={2512.23379},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.23379}, 
}

πŸ™‡ Acknowledgement

  • Infinitetalk and Wan: the base model we built upon.
  • Self forcing: the codebase we built upon.
  • DMD and Self forcing++: the key distillation technique used by our method.

    If you find our work useful, please also consider starring the original repositories of these foundational methods.

πŸ’‘ Star History

Star History Chart

Downloads last month
1,297
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Soul-AILab/SoulX-FlashTalk-14B