English
xg-chu commited on
Commit
708f3e8
·
verified ·
1 Parent(s): d237cc5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -3
README.md CHANGED
@@ -1,3 +1,145 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ datasets:
4
+ - xg-chu/UniLSTalkDataset
5
+ language:
6
+ - en
7
+ ---
8
+ <h1 align="center"><b>UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking</b></h1>
9
+ <h3 align="center">
10
+ <a href='https://arxiv.org/abs/2512.09327'><img src='https://img.shields.io/badge/ArXiv-PDF-red'></a> &nbsp;
11
+ <a href='https://xg-chu.site/project_unils/'><img src='https://img.shields.io/badge/Project-Page-blue'></a> &nbsp;
12
+ <a href='https://huggingface.co/xg-chu/UniLS'><img src='https://img.shields.io/badge/HuggingFace-Weights-yellow'></a> &nbsp;
13
+ <a href='https://huggingface.co/datasets/xg-chu/UniLSTalkDataset'><img src='https://img.shields.io/badge/HuggingFace-Dataset-yellow'></a> &nbsp;
14
+ </h3>
15
+
16
+ <h5 align="center">
17
+ <a href="https://xg-chu.site">Xuangeng Chu</a><sup>*1</sup>&emsp;
18
+ <a href="https://ruicongliu.github.io">Ruicong Liu</a><sup>*1&dagger;</sup>&emsp;
19
+ <a href="https://hyf015.github.io">Yifei Huang</a><sup>1</sup>&emsp;
20
+ <a href="https://scholar.google.com/citations?user=5mbpi0kAAAAJ&hl=zh-TW">Yun Liu</a><sup>2</sup>&emsp;
21
+ <a href="https://puckikk1202.github.io">Yichen Peng</a><sup>3</sup>&emsp;
22
+ <a href="http://www.bozheng-lab.com">Bo Zheng</a><sup>2</sup>
23
+ <br>
24
+ <sup>1</sup>Shanda AI Research Tokyo, The University of Tokyo,
25
+ <sup>2</sup>Shanda AI Research Tokyo,
26
+ <sup>3</sup>Institute of Science Tokyo
27
+ <br>
28
+ <sup>*</sup>Equal contribution,
29
+ <sup>&dagger;</sup>Corresponding author
30
+ </h5>
31
+
32
+ <div align="center">
33
+ <b>
34
+ UniLS generates diverse and natural listening and speaking motions from audio.
35
+ </b>
36
+ </div>
37
+
38
+ ## Installation
39
+ ### Clone the project
40
+ ```
41
+ git clone --recurse-submodules git@github.com:xg-chu/UniLS.git
42
+ cd UniLS
43
+ ```
44
+
45
+ ### Build environment
46
+ ```
47
+ conda env create -f environment.yml
48
+ conda activate unils
49
+ ```
50
+ Or install manually:
51
+ ```
52
+ pip install torch torchvision torchaudio
53
+ pip install accelerate transformers peft einops omegaconf lmdb tqdm scipy wandb
54
+ ```
55
+
56
+ ### Pretrained Models
57
+ Download the pretrained models from [HuggingFace](https://huggingface.co/xg-chu/UniLS).
58
+
59
+ ### Data
60
+ Download the dataset from [UniLS-Talk Dataset](https://huggingface.co/datasets/xg-chu/UniLSTalkDataset).
61
+
62
+ ## Training
63
+
64
+ UniLS follows a three-stage training pipeline:
65
+
66
+ **Stage 1: Motion Codec (VAE)**
67
+ ```
68
+ python train.py -c unils_codec
69
+ ```
70
+
71
+ **Stage 2: Audio-Free Autoregressive Generator**
72
+
73
+ Modify `VAE_PATH` path in the config file to point to the Stage 1 checkpoint, then run:
74
+ ```
75
+ python train.py -c unils_freegen
76
+ ```
77
+
78
+ **Stage 3: Audio-Conditioned LoRA Fine-tuning**
79
+
80
+ Modify `PRETRAIN_PATH` path in the config file to point to the Stage 2 checkpoint, then run:
81
+ ```
82
+ python train.py -c unils_loragen
83
+ ```
84
+
85
+ ## Evaluation
86
+ Run evaluation with multi-GPU support via Accelerate:
87
+ ```
88
+ accelerate launch eval.py -r /path/to/checkpoint --tau 1.0 --cfg 1.5
89
+ ```
90
+ You can also pass an external dataset config to override the checkpoint's dataset:
91
+ ```
92
+ accelerate launch eval.py -r /path/to/checkpoint --dataset configs/dataset.yaml
93
+ ```
94
+
95
+ ## Inference
96
+
97
+ ### From Dataset
98
+ Generate visualizations from the dataset:
99
+ ```
100
+ python infer_dataset.py -r /path/to/checkpoint --clip_length 20 --tau 1.0 --cfg 1.5 --num_samples 32
101
+ ```
102
+ - `--resume_path, -r`: Path to the trained model checkpoint.
103
+ - `--dataset`: Path to a dataset YAML config (optional, uses checkpoint config by default).
104
+ - `--clip_length`: Duration of the generated clip in seconds (default: 20).
105
+ - `--tau`: Temperature for sampling (default: 1.0).
106
+ - `--cfg`: Classifier-free guidance scale (default: 1.5).
107
+ - `--num_samples, -n`: Number of samples to generate (default: 32).
108
+ - `--dump_dir, -d`: Output directory (default: `./render_results`).
109
+
110
+ ### From Audio Files
111
+ Generate visualizations directly from audio files, supporting one or two speakers:
112
+ ```
113
+ # Single speaker
114
+ python infer_audio.py -r /path/to/checkpoint -a speaker0.wav
115
+
116
+ # Two speakers (dyadic conversation)
117
+ python infer_audio.py -r /path/to/checkpoint -a speaker0.wav --audio2 speaker1.wav
118
+ ```
119
+ - `--resume_path, -r`: Path to the trained model checkpoint.
120
+ - `--audio, -a`: Path to speaker 0 audio file.
121
+ - `--audio2`: Path to speaker 1 audio file (optional; if omitted, only speaker 0 motion is generated).
122
+ - `--tau`: Temperature for sampling (default: 1.0).
123
+ - `--cfg`: Classifier-free guidance scale (default: 1.5).
124
+ - `--dump_dir, -d`: Output directory (default: `./render_results`).
125
+
126
+
127
+ ## Acknowledgements
128
+
129
+ Some part of our work is built based on FLAME. We also thank the following projects:
130
+ - **FLAME**: https://flame.is.tue.mpg.de
131
+ - **EMICA**: https://github.com/radekd91/inferno
132
+
133
+ ## Citation
134
+ If you find our work useful in your research, please consider citing:
135
+ ```bibtex
136
+ @misc{chu2025unils,
137
+ title={UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking},
138
+ author={Xuangeng Chu and Ruicong Liu and Yifei Huang and Yun Liu and Yichen Peng and Bo Zheng},
139
+ year={2025},
140
+ eprint={2512.09327},
141
+ archivePrefix={arXiv},
142
+ primaryClass={cs.CV},
143
+ url={https://arxiv.org/abs/2512.09327},
144
+ }
145
+ ```