File size: 5,027 Bytes
708f3e8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | ---
license: cc-by-nc-4.0
datasets:
- xg-chu/UniLSTalkDataset
language:
- en
---
<h1 align="center"><b>UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking</b></h1>
<h3 align="center">
<a href='https://arxiv.org/abs/2512.09327'><img src='https://img.shields.io/badge/ArXiv-PDF-red'></a>
<a href='https://xg-chu.site/project_unils/'><img src='https://img.shields.io/badge/Project-Page-blue'></a>
<a href='https://huggingface.co/xg-chu/UniLS'><img src='https://img.shields.io/badge/HuggingFace-Weights-yellow'></a>
<a href='https://huggingface.co/datasets/xg-chu/UniLSTalkDataset'><img src='https://img.shields.io/badge/HuggingFace-Dataset-yellow'></a>
</h3>
<h5 align="center">
<a href="https://xg-chu.site">Xuangeng Chu</a><sup>*1</sup> 
<a href="https://ruicongliu.github.io">Ruicong Liu</a><sup>*1†</sup> 
<a href="https://hyf015.github.io">Yifei Huang</a><sup>1</sup> 
<a href="https://scholar.google.com/citations?user=5mbpi0kAAAAJ&hl=zh-TW">Yun Liu</a><sup>2</sup> 
<a href="https://puckikk1202.github.io">Yichen Peng</a><sup>3</sup> 
<a href="http://www.bozheng-lab.com">Bo Zheng</a><sup>2</sup>
<br>
<sup>1</sup>Shanda AI Research Tokyo, The University of Tokyo,
<sup>2</sup>Shanda AI Research Tokyo,
<sup>3</sup>Institute of Science Tokyo
<br>
<sup>*</sup>Equal contribution,
<sup>†</sup>Corresponding author
</h5>
<div align="center">
<b>
UniLS generates diverse and natural listening and speaking motions from audio.
</b>
</div>
## Installation
### Clone the project
```
git clone --recurse-submodules git@github.com:xg-chu/UniLS.git
cd UniLS
```
### Build environment
```
conda env create -f environment.yml
conda activate unils
```
Or install manually:
```
pip install torch torchvision torchaudio
pip install accelerate transformers peft einops omegaconf lmdb tqdm scipy wandb
```
### Pretrained Models
Download the pretrained models from [HuggingFace](https://huggingface.co/xg-chu/UniLS).
### Data
Download the dataset from [UniLS-Talk Dataset](https://huggingface.co/datasets/xg-chu/UniLSTalkDataset).
## Training
UniLS follows a three-stage training pipeline:
**Stage 1: Motion Codec (VAE)**
```
python train.py -c unils_codec
```
**Stage 2: Audio-Free Autoregressive Generator**
Modify `VAE_PATH` path in the config file to point to the Stage 1 checkpoint, then run:
```
python train.py -c unils_freegen
```
**Stage 3: Audio-Conditioned LoRA Fine-tuning**
Modify `PRETRAIN_PATH` path in the config file to point to the Stage 2 checkpoint, then run:
```
python train.py -c unils_loragen
```
## Evaluation
Run evaluation with multi-GPU support via Accelerate:
```
accelerate launch eval.py -r /path/to/checkpoint --tau 1.0 --cfg 1.5
```
You can also pass an external dataset config to override the checkpoint's dataset:
```
accelerate launch eval.py -r /path/to/checkpoint --dataset configs/dataset.yaml
```
## Inference
### From Dataset
Generate visualizations from the dataset:
```
python infer_dataset.py -r /path/to/checkpoint --clip_length 20 --tau 1.0 --cfg 1.5 --num_samples 32
```
- `--resume_path, -r`: Path to the trained model checkpoint.
- `--dataset`: Path to a dataset YAML config (optional, uses checkpoint config by default).
- `--clip_length`: Duration of the generated clip in seconds (default: 20).
- `--tau`: Temperature for sampling (default: 1.0).
- `--cfg`: Classifier-free guidance scale (default: 1.5).
- `--num_samples, -n`: Number of samples to generate (default: 32).
- `--dump_dir, -d`: Output directory (default: `./render_results`).
### From Audio Files
Generate visualizations directly from audio files, supporting one or two speakers:
```
# Single speaker
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav
# Two speakers (dyadic conversation)
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav --audio2 speaker1.wav
```
- `--resume_path, -r`: Path to the trained model checkpoint.
- `--audio, -a`: Path to speaker 0 audio file.
- `--audio2`: Path to speaker 1 audio file (optional; if omitted, only speaker 0 motion is generated).
- `--tau`: Temperature for sampling (default: 1.0).
- `--cfg`: Classifier-free guidance scale (default: 1.5).
- `--dump_dir, -d`: Output directory (default: `./render_results`).
## Acknowledgements
Some part of our work is built based on FLAME. We also thank the following projects:
- **FLAME**: https://flame.is.tue.mpg.de
- **EMICA**: https://github.com/radekd91/inferno
## Citation
If you find our work useful in your research, please consider citing:
```bibtex
@misc{chu2025unils,
title={UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking},
author={Xuangeng Chu and Ruicong Liu and Yifei Huang and Yun Liu and Yichen Peng and Bo Zheng},
year={2025},
eprint={2512.09327},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.09327},
}
``` |