--- license: cc-by-nc-4.0 datasets: - xg-chu/UniLSTalkDataset language: - en ---

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

       

Xuangeng Chu*1Ruicong Liu*1†Yifei Huang1Yun Liu2Yichen Peng3Bo Zheng2
1Shanda AI Research Tokyo, The University of Tokyo, 2Shanda AI Research Tokyo, 3Institute of Science Tokyo
*Equal contribution, Corresponding author
UniLS generates diverse and natural listening and speaking motions from audio.
## Installation ### Clone the project ``` git clone --recurse-submodules git@github.com:xg-chu/UniLS.git cd UniLS ``` ### Build environment ``` conda env create -f environment.yml conda activate unils ``` Or install manually: ``` pip install torch torchvision torchaudio pip install accelerate transformers peft einops omegaconf lmdb tqdm scipy wandb ``` ### Pretrained Models Download the pretrained models from [HuggingFace](https://huggingface.co/xg-chu/UniLS). ### Data Download the dataset from [UniLS-Talk Dataset](https://huggingface.co/datasets/xg-chu/UniLSTalkDataset). ## Training UniLS follows a three-stage training pipeline: **Stage 1: Motion Codec (VAE)** ``` python train.py -c unils_codec ``` **Stage 2: Audio-Free Autoregressive Generator** Modify `VAE_PATH` path in the config file to point to the Stage 1 checkpoint, then run: ``` python train.py -c unils_freegen ``` **Stage 3: Audio-Conditioned LoRA Fine-tuning** Modify `PRETRAIN_PATH` path in the config file to point to the Stage 2 checkpoint, then run: ``` python train.py -c unils_loragen ``` ## Evaluation Run evaluation with multi-GPU support via Accelerate: ``` accelerate launch eval.py -r /path/to/checkpoint --tau 1.0 --cfg 1.5 ``` You can also pass an external dataset config to override the checkpoint's dataset: ``` accelerate launch eval.py -r /path/to/checkpoint --dataset configs/dataset.yaml ``` ## Inference ### From Dataset Generate visualizations from the dataset: ``` python infer_dataset.py -r /path/to/checkpoint --clip_length 20 --tau 1.0 --cfg 1.5 --num_samples 32 ``` - `--resume_path, -r`: Path to the trained model checkpoint. - `--dataset`: Path to a dataset YAML config (optional, uses checkpoint config by default). - `--clip_length`: Duration of the generated clip in seconds (default: 20). - `--tau`: Temperature for sampling (default: 1.0). - `--cfg`: Classifier-free guidance scale (default: 1.5). - `--num_samples, -n`: Number of samples to generate (default: 32). - `--dump_dir, -d`: Output directory (default: `./render_results`). ### From Audio Files Generate visualizations directly from audio files, supporting one or two speakers: ``` # Single speaker python infer_audio.py -r /path/to/checkpoint -a speaker0.wav # Two speakers (dyadic conversation) python infer_audio.py -r /path/to/checkpoint -a speaker0.wav --audio2 speaker1.wav ``` - `--resume_path, -r`: Path to the trained model checkpoint. - `--audio, -a`: Path to speaker 0 audio file. - `--audio2`: Path to speaker 1 audio file (optional; if omitted, only speaker 0 motion is generated). - `--tau`: Temperature for sampling (default: 1.0). - `--cfg`: Classifier-free guidance scale (default: 1.5). - `--dump_dir, -d`: Output directory (default: `./render_results`). ## Acknowledgements Some part of our work is built based on FLAME. We also thank the following projects: - **FLAME**: https://flame.is.tue.mpg.de - **EMICA**: https://github.com/radekd91/inferno ## Citation If you find our work useful in your research, please consider citing: ```bibtex @misc{chu2025unils, title={UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking}, author={Xuangeng Chu and Ruicong Liu and Yifei Huang and Yun Liu and Yichen Peng and Bo Zheng}, year={2025}, eprint={2512.09327}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.09327}, } ```