Text-to-Speech
Transformers
Safetensors
Chinese
English
talking-head
text-to-video
audio-video-generation
autoregressive
diffusion
Instructions to use HKUSTAudio/Talker-T2AV with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HKUSTAudio/Talker-T2AV with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="HKUSTAudio/Talker-T2AV")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("HKUSTAudio/Talker-T2AV", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - zh | |
| - en | |
| tags: | |
| - talking-head | |
| - text-to-video | |
| - audio-video-generation | |
| - autoregressive | |
| - diffusion | |
| library_name: transformers | |
| pipeline_tag: text-to-speech | |
| # Talker-T2AV | |
| **Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling** | |
| [Paper (arXiv 2604.23586)](https://arxiv.org/abs/2604.23586) · | |
| [Code (GitHub)](https://github.com/zhenye234/Talker-T2AV) · | |
| [Samples](https://talker-t2av.github.io/) | |
| This repository hosts the pretrained weights for the paper | |
| "Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive | |
| Diffusion Modeling". | |
| ## Contents | |
| ``` | |
| talker-t2av/ | |
| model.safetensors ← AR backbone (Qwen3-0.6B) + dual diffusion heads | |
| + Patch Transformer Encoder + Stop Predictor | |
| config.json | |
| chat_template.jinja | |
| tokenizer.json | |
| tokenizer_config.json | |
| whisperx-vae/ | |
| model.ckpt ← WhisperX-VAE audio autoencoder | |
| (32-d, 25 Hz; Whisper-Large-v3 encoder + DAC backbone) | |
| ``` | |
| For the LIA-X video motion autoencoder (40-d motion, 25 Hz), the model | |
| **code** is vendored under `lia_x/` in the GitHub repo — only the | |
| `lia-x.pt` weight file needs to be fetched separately from | |
| [wyhsirius/LIA-X](https://github.com/wyhsirius/LIA-X). The WavLM-Large | |
| fine-tuned speaker encoder (`wavlm_large_finetune.pth`) similarly ships | |
| its code under `speaker_verification/`; only the `.pth` weights need to | |
| be obtained from | |
| [Microsoft UniSpeech](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification). | |
| ## Quickstart | |
| ```bash | |
| git clone https://github.com/zhenye234/Talker-T2AV.git | |
| cd Talker-T2AV | |
| # put the HF-hosted weights in place | |
| huggingface-cli download HKUSTAudio/Talker-T2AV --local-dir ./hf_weights | |
| export CHECKPOINT_DIR="$(pwd)/hf_weights/talker-t2av" | |
| export WHISPERVAE_CKPT="$(pwd)/hf_weights/whisperx-vae/model.ckpt" | |
| # the two extra weight files (code already vendored — no need to clone the repos) | |
| export LIAX_CKPT=/path/to/lia-x.pt | |
| export WAVLM_CKPT=/path/to/wavlm_large_finetune.pth | |
| python infer.py | |
| ``` | |
| See the [GitHub README](https://github.com/zhenye234/Talker-T2AV) for full | |
| installation and reproduction instructions. | |
| ## Citation | |
| ```bibtex | |
| @misc{ye2026talkert2avjointtalkingaudiovideo, | |
| title={Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling}, | |
| author={Zhen Ye and Xu Tan and Aoxiong Yin and Hongzhan Lin and Guangyan Zhang and Peiwen Sun and Yiming Li and Chi-Min Chan and Wei Ye and Shikun Zhang and Wei Xue}, | |
| year={2026}, | |
| eprint={2604.23586}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2604.23586}, | |
| } | |
| ``` |