--- license: apache-2.0 language: - zh - en tags: - talking-head - text-to-video - audio-video-generation - autoregressive - diffusion library_name: transformers pipeline_tag: text-to-speech --- # Talker-T2AV **Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling** [Paper (arXiv 2604.23586)](https://arxiv.org/abs/2604.23586) · [Code (GitHub)](https://github.com/zhenye234/Talker-T2AV) · [Samples](https://talker-t2av.github.io/) This repository hosts the pretrained weights for the paper "Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling". ## Contents ``` talker-t2av/ model.safetensors ← AR backbone (Qwen3-0.6B) + dual diffusion heads + Patch Transformer Encoder + Stop Predictor config.json chat_template.jinja tokenizer.json tokenizer_config.json whisperx-vae/ model.ckpt ← WhisperX-VAE audio autoencoder (32-d, 25 Hz; Whisper-Large-v3 encoder + DAC backbone) ``` For the LIA-X video motion autoencoder (40-d motion, 25 Hz), the model **code** is vendored under `lia_x/` in the GitHub repo — only the `lia-x.pt` weight file needs to be fetched separately from [wyhsirius/LIA-X](https://github.com/wyhsirius/LIA-X). The WavLM-Large fine-tuned speaker encoder (`wavlm_large_finetune.pth`) similarly ships its code under `speaker_verification/`; only the `.pth` weights need to be obtained from [Microsoft UniSpeech](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification). ## Quickstart ```bash git clone https://github.com/zhenye234/Talker-T2AV.git cd Talker-T2AV # put the HF-hosted weights in place huggingface-cli download HKUSTAudio/Talker-T2AV --local-dir ./hf_weights export CHECKPOINT_DIR="$(pwd)/hf_weights/talker-t2av" export WHISPERVAE_CKPT="$(pwd)/hf_weights/whisperx-vae/model.ckpt" # the two extra weight files (code already vendored — no need to clone the repos) export LIAX_CKPT=/path/to/lia-x.pt export WAVLM_CKPT=/path/to/wavlm_large_finetune.pth python infer.py ``` See the [GitHub README](https://github.com/zhenye234/Talker-T2AV) for full installation and reproduction instructions. ## Citation ```bibtex @misc{ye2026talkert2avjointtalkingaudiovideo, title={Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling}, author={Zhen Ye and Xu Tan and Aoxiong Yin and Hongzhan Lin and Guangyan Zhang and Peiwen Sun and Yiming Li and Chi-Min Chan and Wei Ye and Shikun Zhang and Wei Xue}, year={2026}, eprint={2604.23586}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2604.23586}, } ```