| <h2 align="center"<strong>MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space</strong></h2> | |
| <p align="center"> | |
| <a href='https://li-xingxiao.github.io/homepage/' target='_blank'>Lixing Xiao</a><sup>1</sup> | |
| · | |
| <a href='https://shunlinlu.github.io/' target='_blank'>Shunlin Lu</a> <sup>2</sup> | |
| · | |
| <a href='https://phj128.github.io/' target='_blank'>Huaijin Pi</a><sup>3</sup> | |
| · | |
| <a href='https://vankouf.github.io/' target='_blank'>Ke Fan</a><sup>4</sup> | |
| · | |
| <a href='https://liangpan99.github.io/' target='_blank'>Liang Pan</a><sup>3</sup> | |
| · | |
| <a href='https://yueezhou7@gmail.com' target='_blank'>Yueer Zhou</a><sup>1</sup> | |
| · | |
| <a href='https://dblp.org/pid/120/4362.html/' target='_blank'>Ziyong Feng</a><sup>5</sup> | |
| · | |
| <br> | |
| <a href='https://www.xzhou.me/' target='_blank'>Xiaowei Zhou</a><sup>1</sup> | |
| · | |
| <a href='https://pengsida.net/' target='_blank'>Sida Peng</a><sup>1†</sup> | |
| · | |
| <a href='https://wangjingbo1219.github.io/' target='_blank'>Jingbo Wang</a><sup>6</sup> | |
| <br> | |
| <br> | |
| <sup>1</sup>Zhejiang University <sup>2</sup>The Chinese University of Hong Kong, Shenzhen <sup>3</sup>The University of Hong Kong <br><sup>4</sup>Shanghai Jiao Tong University <sup>5</sup>DeepGlint <sup>6</sup>Shanghai AI Lab | |
| <br> | |
| <strong>ICCV 2025</strong> | |
| </p> | |
| </p> | |
| <p align="center"> | |
| <a href='https://arxiv.org/abs/2503.15451'> | |
| <img src='https://img.shields.io/badge/Arxiv-2503.15451-A42C25?style=flat&logo=arXiv&logoColor=A42C25'></a> | |
| <a href='https://arxiv.org/pdf/2503.15451'> | |
| <img src='https://img.shields.io/badge/Paper-PDF-blue?style=flat&logo=arXiv&logoColor=blue'></a> | |
| <a href='https://zju3dv.github.io/MotionStreamer/'> | |
| <img src='https://img.shields.io/badge/Project-Page-green?style=flat&logo=Google%20chrome&logoColor=green'></a> | |
| <a href='https://huggingface.co/datasets/lxxiao/272-dim-HumanML3D'> | |
| <img src='https://img.shields.io/badge/Data-Download-yellow?style=flat&logo=huggingface&logoColor=yellow'></a> | |
| </p> | |
| <img width="1385" alt="image" src="assets/teaser.jpg"/> | |
| ## 🔥 News | |
| - **[2025-06]** MotionStreamer has been accepted to ICCV 2025! 🎉 | |
| ## TODO List | |
| - [x] Release the processing script of 272-dim motion representation. | |
| - [x] Release the processed 272-dim Motion Representation of [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset. Only for academic usage. | |
| - [x] Release the training code and checkpoint of our [TMR](https://github.com/Mathux/TMR)-based motion evaluator trained on the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset. | |
| - [x] Release the training and evaluation code as well as checkpoint of Causal TAE. | |
| - [x] Release the training code of original motion generation model and streaming generation model (MotionStreamer). | |
| - [x] Release the checkpoint and demo inference code of original motion generation model. | |
| - [ ] Release complete code for MotionStreamer. | |
| ## 🏃 Motion Representation | |
| For more details of how to obtain the 272-dim motion representation, as well as other useful tools (e.g., Visualization and Conversion to BVH format), please refer to our [GitHub repo](https://github.com/Li-xingXiao/272-dim-Motion-Representation). | |
| ## Installation | |
| ### 🐍 Python Virtual Environment | |
| ```sh | |
| conda env create -f environment.yaml | |
| conda activate mgpt | |
| ``` | |
| ### 🤗 Hugging Face Mirror | |
| Since all of our models and data are available on Hugging Face, if Hugging Face is not directly accessible, you can use the HF-mirror tools following: | |
| ```sh | |
| pip install -U huggingface_hub | |
| export HF_ENDPOINT=https://hf-mirror.com | |
| ``` | |
| ## 📥 Data Preparation | |
| To facilitate researchers, we provide the processed 272-dim Motion Representation of: | |
| > HumanML3D dataset at [this link](https://huggingface.co/datasets/lxxiao/272-dim-HumanML3D). | |
| > BABEL dataset at [this link](https://huggingface.co/datasets/lxxiao/272-dim-BABEL). | |
| ❗️❗️❗️ The processed data is solely for academic purposes. Make sure you read through the [AMASS License](https://amass.is.tue.mpg.de/license.html). | |
| 1. Download the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset following: | |
| ```bash | |
| huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-HumanML3D --local-dir ./humanml3d_272 | |
| cd ./humanml3d_272 | |
| unzip texts.zip | |
| unzip motion_data.zip | |
| ``` | |
| The dataset is organized as: | |
| ``` | |
| ./humanml3d_272 | |
| ├── mean_std | |
| ├── Mean.npy | |
| ├── Std.npy | |
| ├── split | |
| ├── train.txt | |
| ├── val.txt | |
| ├── test.txt | |
| ├── texts | |
| ├── 000000.txt | |
| ... | |
| ├── motion_data | |
| ├── 000000.npy | |
| ... | |
| ``` | |
| 2. Download the processed 272-dim [BABEL](https://babel.is.tue.mpg.de/) dataset following: | |
| ```bash | |
| huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL --local-dir ./babel_272 | |
| cd ./babel_272 | |
| unzip texts.zip | |
| unzip motion_data.zip | |
| ``` | |
| The dataset is organized as: | |
| ``` | |
| ./babel_272 | |
| ├── t2m_babel_mean_std | |
| ├── Mean.npy | |
| ├── Std.npy | |
| ├── split | |
| ├── train.txt | |
| ├── val.txt | |
| ├── texts | |
| ├── 000000.txt | |
| ... | |
| ├── motion_data | |
| ├── 000000.npy | |
| ... | |
| ``` | |
| 3. Download the processed streaming 272-dim [BABEL](https://babel.is.tue.mpg.de/) dataset following: | |
| ```bash | |
| huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL-stream --local-dir ./babel_272_stream | |
| cd ./babel_272_stream | |
| unzip train_stream.zip | |
| unzip train_stream_text.zip | |
| unzip val_stream.zip | |
| unzip val_stream_text.zip | |
| ``` | |
| The dataset is organized as: | |
| ``` | |
| ./babel_272_stream | |
| ├── train_stream | |
| ├── seq1.npy | |
| ... | |
| ├── train_stream_text | |
| ├── seq1.txt | |
| ... | |
| ├── val_stream | |
| ├── seq1.npy | |
| ... | |
| ├── val_stream_text | |
| ├── seq1.txt | |
| ... | |
| ``` | |
| > NOTE: We process the original BABEL dataset to support training of streaming motion generation. e.g. If there is a motion sequence A, annotated as (A1, A2, A3, A4) in BABEL dataset, each subsequence has text description: (A1_t, A2_t, A3_t, A4_t). | |
| > Then, our BABEL-stream is constructed as: | |
| > seq1: (A1, A2) --- seq1_text: (A1_t*A2_t#A1_length) | |
| > seq2: (A2, A3) --- seq2_text: (A2_t*A3_t#A2_length) | |
| > seq3: (A3, A4) --- seq3_text: (A3_t*A4_t#A3_length) | |
| > Here, * and # is separation symbol, A1_length means the number of frames of subsequence A1. | |
| ## 🚀 Training | |
| 1. Train our [TMR](https://github.com/Mathux/TMR)-based motion evaluator on the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset: | |
| ```bash | |
| bash TRAIN_evaluator_272.sh | |
| ``` | |
| >After training for 100 epochs, the checkpoint will be stored at: | |
| ``Evaluator_272/experiments/temos/EXP1/checkpoints/``. | |
| ⬇️ We provide the evaluator checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Evaluator_272), download it following: | |
| ```bash | |
| python humanml3d_272/prepare/download_evaluator_ckpt.py | |
| ``` | |
| >The downloaded checkpoint will be stored at: ``Evaluator_272/``. | |
| 2. Train the Causal TAE: | |
| ```bash | |
| bash TRAIN_causal_TAE.sh ${NUM_GPUS} | |
| ``` | |
| > e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8 | |
| > The checkpoint will be stored at: | |
| ``Experiments/causal_TAE_t2m_272/`` | |
| > Tensorboard visualization: | |
| ```bash | |
| tensorboard --logdir='Experiments/causal_TAE_t2m_272' | |
| ``` | |
| ⬇️ We provide the Causal TAE checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Causal_TAE), download it following: | |
| ```bash | |
| python humanml3d_272/prepare/download_Causal_TAE_t2m_272_ckpt.py | |
| ``` | |
| 3. Train text to motion model: | |
| > We provide scripts to train the original text to motion generation model with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (trained in the first stage). | |
| 3.1 Get motion latents: | |
| ```bash | |
| python get_latent.py --resume-pth Causal_TAE/net_last.pth --latent_dir humanml3d_272/t2m_latents | |
| ``` | |
| 3.2 Download [sentence-T5-XXL model](https://huggingface.co/sentence-transformers/sentence-t5-xxl/tree/main) on Hugging Face: | |
| ```bash | |
| huggingface-cli download --resume-download sentence-transformers/sentence-t5-xxl --local-dir sentencet5-xxl/ | |
| ``` | |
| 3.3 Train text to motion generation model: | |
| ```bash | |
| bash TRAIN_t2m.sh ${NUM_GPUS} | |
| ``` | |
| > e.g., if you have 8 GPUs, run: bash TRAIN_t2m.sh 8 | |
| > The checkpoint will be stored at: | |
| ``Experiments/t2m_model/`` | |
| > Tensorboard visualization: | |
| ```bash | |
| tensorboard --logdir='Experiments/t2m_model' | |
| ``` | |
| ⬇️ We provide the text to motion model checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Experiments/t2m_model), download it following: | |
| ```bash | |
| python humanml3d_272/prepare/download_t2m_model_ckpt.py | |
| ``` | |
| 4. Train streaming motion generation model (MotionStreamer): | |
| > We provide scripts to train the streaming motion generation model (MotionStreamer) with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (need to train a new Causal TAE using both HumanML3D-272 and BABEL-272 data). | |
| 4.1 Train a Causal TAE using both HumanML3D-272 and BABEL-272 data: | |
| ```bash | |
| bash TRAIN_causal_TAE.sh ${NUM_GPUS} t2m_babel_272 | |
| ``` | |
| > e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8 t2m_babel_272 | |
| > The checkpoint will be stored at: | |
| ``Experiments/causal_TAE_t2m_babel_272/`` | |
| > Tensorboard visualization: | |
| ```bash | |
| tensorboard --logdir='Experiments/causal_TAE_t2m_babel_272' | |
| ``` | |
| ⬇️ We provide the Causal TAE checkpoint trained using both HumanML3D-272 and BABEL-272 data on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Causal_TAE_t2m_babel), download it following: | |
| ```bash | |
| python humanml3d_272/prepare/download_Causal_TAE_t2m_babel_272_ckpt.py | |
| ``` | |
| 4.2 Get motion latents of both HumanML3D-272 and the processed BABEL-272-stream dataset: | |
| ```bash | |
| python get_latent.py --resume-pth Causal_TAE_t2m_babel/net_last.pth --latent_dir babel_272_stream/t2m_babel_latents --dataname t2m_babel_272 | |
| ``` | |
| 4.3 Train MotionStreamer model: | |
| ```bash | |
| bash TRAIN_motionstreamer.sh ${NUM_GPUS} | |
| ``` | |
| > e.g., if you have 8 GPUs, run: bash TRAIN_motionstreamer.sh 8 | |
| > The checkpoint will be stored at: | |
| ``Experiments/motionstreamer_model/`` | |
| > Tensorboard visualization: | |
| ```bash | |
| tensorboard --logdir='Experiments/motionstreamer_model' | |
| ``` | |
| ## 📍 Evaluation | |
| 1. Evaluate the metrics of the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset: | |
| ```bash | |
| bash EVAL_GT.sh | |
| ``` | |
| ( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. ) | |
| 2. Evaluate the metrics of Causal TAE: | |
| ```bash | |
| bash EVAL_causal_TAE.sh | |
| ``` | |
| ( FID and MPJPE (mm) are reported. ) | |
| 3. Evaluate the metrics of text to motion model: | |
| ```bash | |
| bash EVAL_t2m.sh | |
| ``` | |
| ( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. ) | |
| ## 🎬 Demo Inference | |
| 1. Inference of text to motion model: | |
| > [Option1] Recover from joint position | |
| ```bash | |
| python demo_t2m.py --text 'a person is walking like a mummy.' --mode pos --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth | |
| ``` | |
| > [Option2] Recover from joint rotation | |
| ```bash | |
| python demo_t2m.py --text 'a person is walking like a mummy.' --mode rot --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth | |
| ``` | |
| > In our 272-dim representation, Inverse Kinematics (IK) is not needed. | |
| > For further conversion to BVH format, please refer to [this repo](https://github.com/Li-xingXiao/272-dim-Motion-Representation?tab=readme-ov-file#6-representation_272-to-bvh-conversion-optional) (Step 6: Representation_272 to BVH conversion). The BVH format of motion animation can be visualizd and edited in [Blender](https://www.blender.org/features/animation/). | |
| ## 🌹 Acknowledgement | |
| This repository builds upon the following awesome datasets and projects: | |
| - [272-dim-Motion-Representation](https://github.com/Li-xingXiao/272-dim-Motion-Representation) | |
| - [AMASS](https://amass.is.tue.mpg.de/index.html) | |
| - [HumanML3D](https://github.com/EricGuo5513/HumanML3D) | |
| - [T2M-GPT](https://github.com/Mael-zys/T2M-GPT) | |
| - [TMR](https://github.com/Mathux/TMR) | |
| - [OpenTMA](https://github.com/LinghaoChan/OpenTMA) | |
| - [Sigma-VAE](https://github.com/orybkin/sigma-vae-pytorch) | |
| - [Scamo](https://github.com/shunlinlu/ScaMo_code) | |
| ## 🤝🏼 Citation | |
| If our project is helpful for your research, please consider citing : | |
| ``` | |
| @article{xiao2025motionstreamer, | |
| title={MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space}, | |
| author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo}, | |
| journal={arXiv preprint arXiv:2503.15451}, | |
| year={2025} | |
| } | |
| ``` | |
| ## Star History | |
| [](https://www.star-history.com/#zju3dv/MotionStreamer&Date) | |