Talker-T2AV / README.md
ZhenYe234's picture
Update README.md
076ca03 verified
---
license: apache-2.0
language:
- zh
- en
tags:
- talking-head
- text-to-video
- audio-video-generation
- autoregressive
- diffusion
library_name: transformers
pipeline_tag: text-to-speech
---
# Talker-T2AV
**Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling**
[Paper (arXiv 2604.23586)](https://arxiv.org/abs/2604.23586) ·
[Code (GitHub)](https://github.com/zhenye234/Talker-T2AV) ·
[Samples](https://talker-t2av.github.io/)
This repository hosts the pretrained weights for the paper
"Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive
Diffusion Modeling".
## Contents
```
talker-t2av/
model.safetensors ← AR backbone (Qwen3-0.6B) + dual diffusion heads
+ Patch Transformer Encoder + Stop Predictor
config.json
chat_template.jinja
tokenizer.json
tokenizer_config.json
whisperx-vae/
model.ckpt ← WhisperX-VAE audio autoencoder
(32-d, 25 Hz; Whisper-Large-v3 encoder + DAC backbone)
```
For the LIA-X video motion autoencoder (40-d motion, 25 Hz), the model
**code** is vendored under `lia_x/` in the GitHub repo — only the
`lia-x.pt` weight file needs to be fetched separately from
[wyhsirius/LIA-X](https://github.com/wyhsirius/LIA-X). The WavLM-Large
fine-tuned speaker encoder (`wavlm_large_finetune.pth`) similarly ships
its code under `speaker_verification/`; only the `.pth` weights need to
be obtained from
[Microsoft UniSpeech](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification).
## Quickstart
```bash
git clone https://github.com/zhenye234/Talker-T2AV.git
cd Talker-T2AV
# put the HF-hosted weights in place
huggingface-cli download HKUSTAudio/Talker-T2AV --local-dir ./hf_weights
export CHECKPOINT_DIR="$(pwd)/hf_weights/talker-t2av"
export WHISPERVAE_CKPT="$(pwd)/hf_weights/whisperx-vae/model.ckpt"
# the two extra weight files (code already vendored — no need to clone the repos)
export LIAX_CKPT=/path/to/lia-x.pt
export WAVLM_CKPT=/path/to/wavlm_large_finetune.pth
python infer.py
```
See the [GitHub README](https://github.com/zhenye234/Talker-T2AV) for full
installation and reproduction instructions.
## Citation
```bibtex
@misc{ye2026talkert2avjointtalkingaudiovideo,
title={Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling},
author={Zhen Ye and Xu Tan and Aoxiong Yin and Hongzhan Lin and Guangyan Zhang and Peiwen Sun and Yiming Li and Chi-Min Chan and Wei Ye and Shikun Zhang and Wei Xue},
year={2026},
eprint={2604.23586},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.23586},
}
```