File size: 13,517 Bytes

0e267a7


<h2 align="center"<strong>MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space</strong></h2>
  <p align="center">
    <a href='https://li-xingxiao.github.io/homepage/' target='_blank'>Lixing Xiao</a><sup>1</sup>
    ·
    <a href='https://shunlinlu.github.io/' target='_blank'>Shunlin Lu</a> <sup>2</sup>
    ·
    <a href='https://phj128.github.io/' target='_blank'>Huaijin Pi</a><sup>3</sup>
    ·
    <a href='https://vankouf.github.io/' target='_blank'>Ke Fan</a><sup>4</sup>
    ·
    <a href='https://liangpan99.github.io/' target='_blank'>Liang Pan</a><sup>3</sup>
    ·
    <a href='https://yueezhou7@gmail.com' target='_blank'>Yueer Zhou</a><sup>1</sup>
    ·
    <a href='https://dblp.org/pid/120/4362.html/' target='_blank'>Ziyong Feng</a><sup>5</sup>
    ·
    <br>
    <a href='https://www.xzhou.me/' target='_blank'>Xiaowei Zhou</a><sup>1</sup>
    ·
    <a href='https://pengsida.net/' target='_blank'>Sida Peng</a><sup>1†</sup>
    ·
     <a href='https://wangjingbo1219.github.io/' target='_blank'>Jingbo Wang</a><sup>6</sup>
    <br>
    <br>
    <sup>1</sup>Zhejiang University  <sup>2</sup>The Chinese University of Hong Kong, Shenzhen  <sup>3</sup>The University of Hong Kong  <br><sup>4</sup>Shanghai Jiao Tong University  <sup>5</sup>DeepGlint  <sup>6</sup>Shanghai AI Lab
    <br>
    <strong>ICCV 2025</strong>
    
  </p>
</p>
<p align="center">
  <a href='https://arxiv.org/abs/2503.15451'>
    <img src='https://img.shields.io/badge/Arxiv-2503.15451-A42C25?style=flat&logo=arXiv&logoColor=A42C25'></a>
  <a href='https://arxiv.org/pdf/2503.15451'>
    <img src='https://img.shields.io/badge/Paper-PDF-blue?style=flat&logo=arXiv&logoColor=blue'></a>
  <a href='https://zju3dv.github.io/MotionStreamer/'>
    <img src='https://img.shields.io/badge/Project-Page-green?style=flat&logo=Google%20chrome&logoColor=green'></a>
  <a href='https://huggingface.co/datasets/lxxiao/272-dim-HumanML3D'>
    <img src='https://img.shields.io/badge/Data-Download-yellow?style=flat&logo=huggingface&logoColor=yellow'></a>
</p>

<img width="1385" alt="image" src="assets/teaser.jpg"/>

## 🔥 News

- **[2025-06]** MotionStreamer has been accepted to ICCV 2025! 🎉
  
## TODO List

- [x] Release the processing script of 272-dim motion representation.
- [x] Release the processed 272-dim Motion Representation of [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset. Only for academic usage.
- [x] Release the training code and checkpoint of our [TMR](https://github.com/Mathux/TMR)-based motion evaluator trained on the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset.
- [x] Release the training and evaluation code as well as checkpoint of Causal TAE.
- [x] Release the training code of original motion generation model and streaming generation model (MotionStreamer).
- [x] Release the checkpoint and demo inference code of original motion generation model.
- [ ] Release complete code for MotionStreamer.

## 🏃 Motion Representation
For more details of how to obtain the 272-dim motion representation, as well as other useful tools (e.g., Visualization and Conversion to BVH format), please refer to our [GitHub repo](https://github.com/Li-xingXiao/272-dim-Motion-Representation).

## Installation

### 🐍 Python Virtual Environment
```sh
conda env create -f environment.yaml
conda activate mgpt
```

### 🤗 Hugging Face Mirror
Since all of our models and data are available on Hugging Face, if Hugging Face is not directly accessible, you can use the HF-mirror tools following:
```sh
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
```

## 📥 Data Preparation
To facilitate researchers, we provide the processed 272-dim Motion Representation of:
> HumanML3D dataset at [this link](https://huggingface.co/datasets/lxxiao/272-dim-HumanML3D).

> BABEL dataset at [this link](https://huggingface.co/datasets/lxxiao/272-dim-BABEL).

❗️❗️❗️ The processed data is solely for academic purposes. Make sure you read through the [AMASS License](https://amass.is.tue.mpg.de/license.html).

1. Download the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset following:
```bash
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-HumanML3D --local-dir ./humanml3d_272
cd ./humanml3d_272
unzip texts.zip
unzip motion_data.zip
```
The dataset is organized as:
```
./humanml3d_272
  ├── mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
      ├── test.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...
```

2. Download the processed 272-dim [BABEL](https://babel.is.tue.mpg.de/) dataset following:
```bash
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL --local-dir ./babel_272
cd ./babel_272
unzip texts.zip
unzip motion_data.zip
```
The dataset is organized as:
```
./babel_272
  ├── t2m_babel_mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...
```

3. Download the processed streaming 272-dim [BABEL](https://babel.is.tue.mpg.de/) dataset following:
```bash
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL-stream --local-dir ./babel_272_stream
cd ./babel_272_stream
unzip train_stream.zip
unzip train_stream_text.zip
unzip val_stream.zip
unzip val_stream_text.zip
```
The dataset is organized as:
```
./babel_272_stream
  ├── train_stream
      ├── seq1.npy
      ...
  ├── train_stream_text
      ├── seq1.txt
      ...
  ├── val_stream
      ├── seq1.npy
      ...
  ├── val_stream_text
      ├── seq1.txt
      ...
```
> NOTE: We process the original BABEL dataset to support training of streaming motion generation. e.g. If there is a motion sequence A, annotated as (A1, A2, A3, A4) in BABEL dataset, each subsequence has text description: (A1_t, A2_t, A3_t, A4_t).

> Then, our BABEL-stream is constructed as:

> seq1: (A1, A2) --- seq1_text: (A1_t*A2_t#A1_length)

> seq2: (A2, A3) --- seq2_text: (A2_t*A3_t#A2_length)

> seq3: (A3, A4) --- seq3_text: (A3_t*A4_t#A3_length)

> Here, * and # is separation symbol, A1_length means the number of frames of subsequence A1.

## 🚀 Training
1. Train our [TMR](https://github.com/Mathux/TMR)-based motion evaluator on the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset:
    ```bash
    bash TRAIN_evaluator_272.sh
    ```
    >After training for 100 epochs, the checkpoint will be stored at: 
    ``Evaluator_272/experiments/temos/EXP1/checkpoints/``.

    ⬇️ We provide the evaluator checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Evaluator_272), download it following:
    ```bash
    python humanml3d_272/prepare/download_evaluator_ckpt.py
    ```
    >The downloaded checkpoint will be stored at: ``Evaluator_272/``.
2. Train the Causal TAE:
    ```bash
    bash TRAIN_causal_TAE.sh ${NUM_GPUS}
    ```
    > e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8

    > The checkpoint will be stored at:
    ``Experiments/causal_TAE_t2m_272/``

    > Tensorboard visualization:
    ```bash
    tensorboard --logdir='Experiments/causal_TAE_t2m_272'
    ```

    ⬇️ We provide the Causal TAE checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Causal_TAE), download it following:
    ```bash
    python humanml3d_272/prepare/download_Causal_TAE_t2m_272_ckpt.py
    ```

3. Train text to motion model:
    > We provide scripts to train the original text to motion generation model with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (trained in the first stage).
    
    3.1 Get motion latents:
   ```bash
   python get_latent.py --resume-pth Causal_TAE/net_last.pth --latent_dir humanml3d_272/t2m_latents
   ```
    3.2 Download [sentence-T5-XXL model](https://huggingface.co/sentence-transformers/sentence-t5-xxl/tree/main) on Hugging Face:
   ```bash
   huggingface-cli download --resume-download sentence-transformers/sentence-t5-xxl --local-dir sentencet5-xxl/
   ```
    3.3 Train text to motion generation model:
   ```bash
   bash TRAIN_t2m.sh ${NUM_GPUS}
   ```
    > e.g., if you have 8 GPUs, run: bash TRAIN_t2m.sh 8

    > The checkpoint will be stored at:
    ``Experiments/t2m_model/``

    > Tensorboard visualization:
    ```bash
    tensorboard --logdir='Experiments/t2m_model'
    ```

    ⬇️ We provide the text to motion model checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Experiments/t2m_model), download it following:
    ```bash
    python humanml3d_272/prepare/download_t2m_model_ckpt.py
    ```

4. Train streaming motion generation model (MotionStreamer):
    > We provide scripts to train the streaming motion generation model (MotionStreamer) with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (need to train a new Causal TAE using both HumanML3D-272 and BABEL-272 data).
    
    4.1 Train a Causal TAE using both HumanML3D-272 and BABEL-272 data:
    ```bash
    bash TRAIN_causal_TAE.sh ${NUM_GPUS} t2m_babel_272
    ```
    > e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8 t2m_babel_272

    > The checkpoint will be stored at:
    ``Experiments/causal_TAE_t2m_babel_272/``

    > Tensorboard visualization:
    ```bash
    tensorboard --logdir='Experiments/causal_TAE_t2m_babel_272'
    ```

    ⬇️ We provide the Causal TAE checkpoint trained using both HumanML3D-272 and BABEL-272 data on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Causal_TAE_t2m_babel), download it following:
    ```bash
    python humanml3d_272/prepare/download_Causal_TAE_t2m_babel_272_ckpt.py
    ```

    4.2 Get motion latents of both HumanML3D-272 and the processed BABEL-272-stream dataset:
   ```bash
   python get_latent.py --resume-pth Causal_TAE_t2m_babel/net_last.pth --latent_dir babel_272_stream/t2m_babel_latents --dataname t2m_babel_272
   ``` 

    4.3 Train MotionStreamer model:
   ```bash
   bash TRAIN_motionstreamer.sh ${NUM_GPUS}
   ```
   > e.g., if you have 8 GPUs, run: bash TRAIN_motionstreamer.sh 8

   > The checkpoint will be stored at:
    ``Experiments/motionstreamer_model/``

    > Tensorboard visualization:
    ```bash
    tensorboard --logdir='Experiments/motionstreamer_model'
    ```

## 📍 Evaluation

1. Evaluate the metrics of the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset:
    ```bash
    bash EVAL_GT.sh
    ```
    ( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )

2. Evaluate the metrics of Causal TAE:
    ```bash
    bash EVAL_causal_TAE.sh
    ```
    ( FID and MPJPE (mm) are reported. )

3. Evaluate the metrics of text to motion model:
    ```bash
    bash EVAL_t2m.sh
    ```
    ( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )


## 🎬 Demo Inference

1. Inference of text to motion model:
    > [Option1] Recover from joint position 
    ```bash
    python demo_t2m.py --text 'a person is walking like a mummy.' --mode pos --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
    ```
    > [Option2] Recover from joint rotation 
    ```bash
    python demo_t2m.py --text 'a person is walking like a mummy.' --mode rot --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
    ```
    > In our 272-dim representation, Inverse Kinematics (IK) is not needed.
    > For further conversion to BVH format, please refer to [this repo](https://github.com/Li-xingXiao/272-dim-Motion-Representation?tab=readme-ov-file#6-representation_272-to-bvh-conversion-optional) (Step 6: Representation_272 to BVH conversion). The BVH format of motion animation can be visualizd and edited in [Blender](https://www.blender.org/features/animation/).




## 🌹 Acknowledgement
This repository builds upon the following awesome datasets and projects:
- [272-dim-Motion-Representation](https://github.com/Li-xingXiao/272-dim-Motion-Representation)
- [AMASS](https://amass.is.tue.mpg.de/index.html)
- [HumanML3D](https://github.com/EricGuo5513/HumanML3D)
- [T2M-GPT](https://github.com/Mael-zys/T2M-GPT)
- [TMR](https://github.com/Mathux/TMR)
- [OpenTMA](https://github.com/LinghaoChan/OpenTMA)
- [Sigma-VAE](https://github.com/orybkin/sigma-vae-pytorch)
- [Scamo](https://github.com/shunlinlu/ScaMo_code)

## 🤝🏼 Citation
If our project is helpful for your research, please consider citing :
``` 
@article{xiao2025motionstreamer,
      title={MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
      author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
      journal={arXiv preprint arXiv:2503.15451},
      year={2025}
    }
```

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=zju3dv/MotionStreamer&type=Date)](https://www.star-history.com/#zju3dv/MotionStreamer&Date)