HKUSTAudio
/

Talker-T2AV

audio-video-generation

Model card Files Files and versions

Talker-T2AV / README.md

ZhenYe234's picture

Update README.md

076ca03 verified 6 days ago

|

history blame contribute delete

2.73 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	tags:
	- talking-head
	- text-to-video
	- audio-video-generation
	- autoregressive
	- diffusion
	library_name: transformers
	pipeline_tag: text-to-speech
	---

	# Talker-T2AV

	Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

	[Paper (arXiv 2604.23586)](https://arxiv.org/abs/2604.23586) ·
	[Code (GitHub)](https://github.com/zhenye234/Talker-T2AV) ·
	[Samples](https://talker-t2av.github.io/)

	This repository hosts the pretrained weights for the paper
	"Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive
	Diffusion Modeling".

	## Contents

	```
	talker-t2av/
	model.safetensors ← AR backbone (Qwen3-0.6B) + dual diffusion heads
	+ Patch Transformer Encoder + Stop Predictor
	config.json
	chat_template.jinja
	tokenizer.json
	tokenizer_config.json

	whisperx-vae/
	model.ckpt ← WhisperX-VAE audio autoencoder
	(32-d, 25 Hz; Whisper-Large-v3 encoder + DAC backbone)
	```

	For the LIA-X video motion autoencoder (40-d motion, 25 Hz), the model
	code is vendored under `lia_x/` in the GitHub repo — only the
	`lia-x.pt` weight file needs to be fetched separately from
	[wyhsirius/LIA-X](https://github.com/wyhsirius/LIA-X). The WavLM-Large
	fine-tuned speaker encoder (`wavlm_large_finetune.pth`) similarly ships
	its code under `speaker_verification/`; only the `.pth` weights need to
	be obtained from
	[Microsoft UniSpeech](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification).

	## Quickstart

	```bash
	git clone https://github.com/zhenye234/Talker-T2AV.git
	cd Talker-T2AV

	# put the HF-hosted weights in place
	huggingface-cli download HKUSTAudio/Talker-T2AV --local-dir ./hf_weights
	export CHECKPOINT_DIR="$(pwd)/hf_weights/talker-t2av"
	export WHISPERVAE_CKPT="$(pwd)/hf_weights/whisperx-vae/model.ckpt"

	# the two extra weight files (code already vendored — no need to clone the repos)
	export LIAX_CKPT=/path/to/lia-x.pt
	export WAVLM_CKPT=/path/to/wavlm_large_finetune.pth

	python infer.py
	```

	See the [GitHub README](https://github.com/zhenye234/Talker-T2AV) for full
	installation and reproduction instructions.

	## Citation

	```bibtex
	@misc{ye2026talkert2avjointtalkingaudiovideo,
	title={Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling},
	author={Zhen Ye and Xu Tan and Aoxiong Yin and Hongzhan Lin and Guangyan Zhang and Peiwen Sun and Yiming Li and Chi-Min Chan and Wei Ye and Shikun Zhang and Wei Xue},
	year={2026},
	eprint={2604.23586},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2604.23586},
	}
	```