File size: 13,517 Bytes
0e267a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337

<h2 align="center"<strong>MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space</strong></h2>
  <p align="center">
    <a href='https://li-xingxiao.github.io/homepage/' target='_blank'>Lixing Xiao</a><sup>1</sup>
    ·
    <a href='https://shunlinlu.github.io/' target='_blank'>Shunlin Lu</a> <sup>2</sup>
    ·
    <a href='https://phj128.github.io/' target='_blank'>Huaijin Pi</a><sup>3</sup>
    ·
    <a href='https://vankouf.github.io/' target='_blank'>Ke Fan</a><sup>4</sup>
    ·
    <a href='https://liangpan99.github.io/' target='_blank'>Liang Pan</a><sup>3</sup>
    ·
    <a href='https://yueezhou7@gmail.com' target='_blank'>Yueer Zhou</a><sup>1</sup>
    ·
    <a href='https://dblp.org/pid/120/4362.html/' target='_blank'>Ziyong Feng</a><sup>5</sup>
    ·
    <br>
    <a href='https://www.xzhou.me/' target='_blank'>Xiaowei Zhou</a><sup>1</sup>
    ·
    <a href='https://pengsida.net/' target='_blank'>Sida Peng</a><sup>1†</sup>
    ·
     <a href='https://wangjingbo1219.github.io/' target='_blank'>Jingbo Wang</a><sup>6</sup>
    <br>
    <br>
    <sup>1</sup>Zhejiang University  <sup>2</sup>The Chinese University of Hong Kong, Shenzhen  <sup>3</sup>The University of Hong Kong  <br><sup>4</sup>Shanghai Jiao Tong University  <sup>5</sup>DeepGlint  <sup>6</sup>Shanghai AI Lab
    <br>
    <strong>ICCV 2025</strong>
    
  </p>
</p>
<p align="center">
  <a href='https://arxiv.org/abs/2503.15451'>
    <img src='https://img.shields.io/badge/Arxiv-2503.15451-A42C25?style=flat&logo=arXiv&logoColor=A42C25'></a>
  <a href='https://arxiv.org/pdf/2503.15451'>
    <img src='https://img.shields.io/badge/Paper-PDF-blue?style=flat&logo=arXiv&logoColor=blue'></a>
  <a href='https://zju3dv.github.io/MotionStreamer/'>
    <img src='https://img.shields.io/badge/Project-Page-green?style=flat&logo=Google%20chrome&logoColor=green'></a>
  <a href='https://huggingface.co/datasets/lxxiao/272-dim-HumanML3D'>
    <img src='https://img.shields.io/badge/Data-Download-yellow?style=flat&logo=huggingface&logoColor=yellow'></a>
</p>

<img width="1385" alt="image" src="assets/teaser.jpg"/>

## 🔥 News

- **[2025-06]** MotionStreamer has been accepted to ICCV 2025! 🎉
  
## TODO List

- [x] Release the processing script of 272-dim motion representation.
- [x] Release the processed 272-dim Motion Representation of [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset. Only for academic usage.
- [x] Release the training code and checkpoint of our [TMR](https://github.com/Mathux/TMR)-based motion evaluator trained on the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset.
- [x] Release the training and evaluation code as well as checkpoint of Causal TAE.
- [x] Release the training code of original motion generation model and streaming generation model (MotionStreamer).
- [x] Release the checkpoint and demo inference code of original motion generation model.
- [ ] Release complete code for MotionStreamer.

## 🏃 Motion Representation
For more details of how to obtain the 272-dim motion representation, as well as other useful tools (e.g., Visualization and Conversion to BVH format), please refer to our [GitHub repo](https://github.com/Li-xingXiao/272-dim-Motion-Representation).

## Installation

### 🐍 Python Virtual Environment
```sh
conda env create -f environment.yaml
conda activate mgpt
```

### 🤗 Hugging Face Mirror
Since all of our models and data are available on Hugging Face, if Hugging Face is not directly accessible, you can use the HF-mirror tools following:
```sh
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
```

## 📥 Data Preparation
To facilitate researchers, we provide the processed 272-dim Motion Representation of:
> HumanML3D dataset at [this link](https://huggingface.co/datasets/lxxiao/272-dim-HumanML3D).

> BABEL dataset at [this link](https://huggingface.co/datasets/lxxiao/272-dim-BABEL).

❗️❗️❗️ The processed data is solely for academic purposes. Make sure you read through the [AMASS License](https://amass.is.tue.mpg.de/license.html).

1. Download the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset following:
```bash
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-HumanML3D --local-dir ./humanml3d_272
cd ./humanml3d_272
unzip texts.zip
unzip motion_data.zip
```
The dataset is organized as:
```
./humanml3d_272
  ├── mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
      ├── test.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...
```

2. Download the processed 272-dim [BABEL](https://babel.is.tue.mpg.de/) dataset following:
```bash
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL --local-dir ./babel_272
cd ./babel_272
unzip texts.zip
unzip motion_data.zip
```
The dataset is organized as:
```
./babel_272
  ├── t2m_babel_mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...
```

3. Download the processed streaming 272-dim [BABEL](https://babel.is.tue.mpg.de/) dataset following:
```bash
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL-stream --local-dir ./babel_272_stream
cd ./babel_272_stream
unzip train_stream.zip
unzip train_stream_text.zip
unzip val_stream.zip
unzip val_stream_text.zip
```
The dataset is organized as:
```
./babel_272_stream
  ├── train_stream
      ├── seq1.npy
      ...
  ├── train_stream_text
      ├── seq1.txt
      ...
  ├── val_stream
      ├── seq1.npy
      ...
  ├── val_stream_text
      ├── seq1.txt
      ...
```
> NOTE: We process the original BABEL dataset to support training of streaming motion generation. e.g. If there is a motion sequence A, annotated as (A1, A2, A3, A4) in BABEL dataset, each subsequence has text description: (A1_t, A2_t, A3_t, A4_t).

> Then, our BABEL-stream is constructed as:

> seq1: (A1, A2) --- seq1_text: (A1_t*A2_t#A1_length)

> seq2: (A2, A3) --- seq2_text: (A2_t*A3_t#A2_length)

> seq3: (A3, A4) --- seq3_text: (A3_t*A4_t#A3_length)

> Here, * and # is separation symbol, A1_length means the number of frames of subsequence A1.

## 🚀 Training
1. Train our [TMR](https://github.com/Mathux/TMR)-based motion evaluator on the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset:
    ```bash
    bash TRAIN_evaluator_272.sh
    ```
    >After training for 100 epochs, the checkpoint will be stored at: 
    ``Evaluator_272/experiments/temos/EXP1/checkpoints/``.

    ⬇️ We provide the evaluator checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Evaluator_272), download it following:
    ```bash
    python humanml3d_272/prepare/download_evaluator_ckpt.py
    ```
    >The downloaded checkpoint will be stored at: ``Evaluator_272/``.
2. Train the Causal TAE:
    ```bash
    bash TRAIN_causal_TAE.sh ${NUM_GPUS}
    ```
    > e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8

    > The checkpoint will be stored at:
    ``Experiments/causal_TAE_t2m_272/``

    > Tensorboard visualization:
    ```bash
    tensorboard --logdir='Experiments/causal_TAE_t2m_272'
    ```

    ⬇️ We provide the Causal TAE checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Causal_TAE), download it following:
    ```bash
    python humanml3d_272/prepare/download_Causal_TAE_t2m_272_ckpt.py
    ```

3. Train text to motion model:
    > We provide scripts to train the original text to motion generation model with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (trained in the first stage).
    
    3.1 Get motion latents:
   ```bash
   python get_latent.py --resume-pth Causal_TAE/net_last.pth --latent_dir humanml3d_272/t2m_latents
   ```
    3.2 Download [sentence-T5-XXL model](https://huggingface.co/sentence-transformers/sentence-t5-xxl/tree/main) on Hugging Face:
   ```bash
   huggingface-cli download --resume-download sentence-transformers/sentence-t5-xxl --local-dir sentencet5-xxl/
   ```
    3.3 Train text to motion generation model:
   ```bash
   bash TRAIN_t2m.sh ${NUM_GPUS}
   ```
    > e.g., if you have 8 GPUs, run: bash TRAIN_t2m.sh 8

    > The checkpoint will be stored at:
    ``Experiments/t2m_model/``

    > Tensorboard visualization:
    ```bash
    tensorboard --logdir='Experiments/t2m_model'
    ```

    ⬇️ We provide the text to motion model checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Experiments/t2m_model), download it following:
    ```bash
    python humanml3d_272/prepare/download_t2m_model_ckpt.py
    ```

4. Train streaming motion generation model (MotionStreamer):
    > We provide scripts to train the streaming motion generation model (MotionStreamer) with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (need to train a new Causal TAE using both HumanML3D-272 and BABEL-272 data).
    
    4.1 Train a Causal TAE using both HumanML3D-272 and BABEL-272 data:
    ```bash
    bash TRAIN_causal_TAE.sh ${NUM_GPUS} t2m_babel_272
    ```
    > e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8 t2m_babel_272

    > The checkpoint will be stored at:
    ``Experiments/causal_TAE_t2m_babel_272/``

    > Tensorboard visualization:
    ```bash
    tensorboard --logdir='Experiments/causal_TAE_t2m_babel_272'
    ```

    ⬇️ We provide the Causal TAE checkpoint trained using both HumanML3D-272 and BABEL-272 data on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Causal_TAE_t2m_babel), download it following:
    ```bash
    python humanml3d_272/prepare/download_Causal_TAE_t2m_babel_272_ckpt.py
    ```

    4.2 Get motion latents of both HumanML3D-272 and the processed BABEL-272-stream dataset:
   ```bash
   python get_latent.py --resume-pth Causal_TAE_t2m_babel/net_last.pth --latent_dir babel_272_stream/t2m_babel_latents --dataname t2m_babel_272
   ``` 

    4.3 Train MotionStreamer model:
   ```bash
   bash TRAIN_motionstreamer.sh ${NUM_GPUS}
   ```
   > e.g., if you have 8 GPUs, run: bash TRAIN_motionstreamer.sh 8

   > The checkpoint will be stored at:
    ``Experiments/motionstreamer_model/``

    > Tensorboard visualization:
    ```bash
    tensorboard --logdir='Experiments/motionstreamer_model'
    ```

## 📍 Evaluation

1. Evaluate the metrics of the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset:
    ```bash
    bash EVAL_GT.sh
    ```
    ( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )

2. Evaluate the metrics of Causal TAE:
    ```bash
    bash EVAL_causal_TAE.sh
    ```
    ( FID and MPJPE (mm) are reported. )

3. Evaluate the metrics of text to motion model:
    ```bash
    bash EVAL_t2m.sh
    ```
    ( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )


## 🎬 Demo Inference

1. Inference of text to motion model:
    > [Option1] Recover from joint position 
    ```bash
    python demo_t2m.py --text 'a person is walking like a mummy.' --mode pos --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
    ```
    > [Option2] Recover from joint rotation 
    ```bash
    python demo_t2m.py --text 'a person is walking like a mummy.' --mode rot --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
    ```
    > In our 272-dim representation, Inverse Kinematics (IK) is not needed.
    > For further conversion to BVH format, please refer to [this repo](https://github.com/Li-xingXiao/272-dim-Motion-Representation?tab=readme-ov-file#6-representation_272-to-bvh-conversion-optional) (Step 6: Representation_272 to BVH conversion). The BVH format of motion animation can be visualizd and edited in [Blender](https://www.blender.org/features/animation/).




## 🌹 Acknowledgement
This repository builds upon the following awesome datasets and projects:
- [272-dim-Motion-Representation](https://github.com/Li-xingXiao/272-dim-Motion-Representation)
- [AMASS](https://amass.is.tue.mpg.de/index.html)
- [HumanML3D](https://github.com/EricGuo5513/HumanML3D)
- [T2M-GPT](https://github.com/Mael-zys/T2M-GPT)
- [TMR](https://github.com/Mathux/TMR)
- [OpenTMA](https://github.com/LinghaoChan/OpenTMA)
- [Sigma-VAE](https://github.com/orybkin/sigma-vae-pytorch)
- [Scamo](https://github.com/shunlinlu/ScaMo_code)

## 🤝🏼 Citation
If our project is helpful for your research, please consider citing :
``` 
@article{xiao2025motionstreamer,
      title={MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
      author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
      journal={arXiv preprint arXiv:2503.15451},
      year={2025}
    }
```

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=zju3dv/MotionStreamer&type=Date)](https://www.star-history.com/#zju3dv/MotionStreamer&Date)