---
license: apache-2.0
language:
- zh
- en
tags:
- talking-head
- text-to-video
- audio-video-generation
- autoregressive
- diffusion
library_name: transformers
pipeline_tag: text-to-speech
---

# Talker-T2AV

**Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling**

[Paper (arXiv 2604.23586)](https://arxiv.org/abs/2604.23586) ·
[Code (GitHub)](https://github.com/zhenye234/Talker-T2AV) ·
[Samples](https://talker-t2av.github.io/)

This repository hosts the pretrained weights for the paper
"Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive
Diffusion Modeling".

## Contents

```
talker-t2av/
  model.safetensors            ← AR backbone (Qwen3-0.6B) + dual diffusion heads
                                  + Patch Transformer Encoder + Stop Predictor
  config.json
  chat_template.jinja
  tokenizer.json
  tokenizer_config.json

whisperx-vae/
  model.ckpt                   ← WhisperX-VAE audio autoencoder
                                  (32-d, 25 Hz; Whisper-Large-v3 encoder + DAC backbone)
```

For the LIA-X video motion autoencoder (40-d motion, 25 Hz), the model
**code** is vendored under `lia_x/` in the GitHub repo — only the
`lia-x.pt` weight file needs to be fetched separately from
[wyhsirius/LIA-X](https://github.com/wyhsirius/LIA-X). The WavLM-Large
fine-tuned speaker encoder (`wavlm_large_finetune.pth`) similarly ships
its code under `speaker_verification/`; only the `.pth` weights need to
be obtained from
[Microsoft UniSpeech](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification).

## Quickstart

```bash
git clone https://github.com/zhenye234/Talker-T2AV.git
cd Talker-T2AV

# put the HF-hosted weights in place
huggingface-cli download HKUSTAudio/Talker-T2AV --local-dir ./hf_weights
export CHECKPOINT_DIR="$(pwd)/hf_weights/talker-t2av"
export WHISPERVAE_CKPT="$(pwd)/hf_weights/whisperx-vae/model.ckpt"

# the two extra weight files (code already vendored — no need to clone the repos)
export LIAX_CKPT=/path/to/lia-x.pt
export WAVLM_CKPT=/path/to/wavlm_large_finetune.pth

python infer.py
```

See the [GitHub README](https://github.com/zhenye234/Talker-T2AV) for full
installation and reproduction instructions.

## Citation

```bibtex
@misc{ye2026talkert2avjointtalkingaudiovideo,
      title={Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling}, 
      author={Zhen Ye and Xu Tan and Aoxiong Yin and Hongzhan Lin and Guangyan Zhang and Peiwen Sun and Yiming Li and Chi-Min Chan and Wei Ye and Shikun Zhang and Wei Xue},
      year={2026},
      eprint={2604.23586},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.23586}, 
}
```