motion-stream / README.md

Initial upload of MotionStreamer code, excluding large extracted data and output folders.

0e267a7 verified 3 months ago

13.5 kB


	<h2 align="center"<strong>MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space</strong></h2>
	<p align="center">
	<a href='https://li-xingxiao.github.io/homepage/' target='_blank'>Lixing Xiao</a><sup>1</sup>
	·
	<a href='https://shunlinlu.github.io/' target='_blank'>Shunlin Lu</a> <sup>2</sup>
	·
	<a href='https://phj128.github.io/' target='_blank'>Huaijin Pi</a><sup>3</sup>
	·
	<a href='https://vankouf.github.io/' target='_blank'>Ke Fan</a><sup>4</sup>
	·
	<a href='https://liangpan99.github.io/' target='_blank'>Liang Pan</a><sup>3</sup>
	·
	<a href='https://yueezhou7@gmail.com' target='_blank'>Yueer Zhou</a><sup>1</sup>
	·
	<a href='https://dblp.org/pid/120/4362.html/' target='_blank'>Ziyong Feng</a><sup>5</sup>
	·
	<br>
	<a href='https://www.xzhou.me/' target='_blank'>Xiaowei Zhou</a><sup>1</sup>
	·
	<a href='https://pengsida.net/' target='_blank'>Sida Peng</a><sup>1†</sup>
	·
	<a href='https://wangjingbo1219.github.io/' target='_blank'>Jingbo Wang</a><sup>6</sup>
	<br>
	<br>
	<sup>1</sup>Zhejiang University <sup>2</sup>The Chinese University of Hong Kong, Shenzhen <sup>3</sup>The University of Hong Kong <br><sup>4</sup>Shanghai Jiao Tong University <sup>5</sup>DeepGlint <sup>6</sup>Shanghai AI Lab
	<br>
	<strong>ICCV 2025</strong>

	</p>
	</p>
	<p align="center">
	<a href='https://arxiv.org/abs/2503.15451'>
	<img src='https://img.shields.io/badge/Arxiv-2503.15451-A42C25?style=flat&logo=arXiv&logoColor=A42C25'></a>
	<a href='https://arxiv.org/pdf/2503.15451'>
	<img src='https://img.shields.io/badge/Paper-PDF-blue?style=flat&logo=arXiv&logoColor=blue'></a>
	<a href='https://zju3dv.github.io/MotionStreamer/'>
	<img src='https://img.shields.io/badge/Project-Page-green?style=flat&logo=Google%20chrome&logoColor=green'></a>
	<a href='https://huggingface.co/datasets/lxxiao/272-dim-HumanML3D'>
	<img src='https://img.shields.io/badge/Data-Download-yellow?style=flat&logo=huggingface&logoColor=yellow'></a>
	</p>

	<img width="1385" alt="image" src="assets/teaser.jpg"/>

	## 🔥 News

	- [2025-06] MotionStreamer has been accepted to ICCV 2025! 🎉

	## TODO List

	- [x] Release the processing script of 272-dim motion representation.
	- [x] Release the processed 272-dim Motion Representation of [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset. Only for academic usage.
	- [x] Release the training code and checkpoint of our [TMR](https://github.com/Mathux/TMR)-based motion evaluator trained on the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset.
	- [x] Release the training and evaluation code as well as checkpoint of Causal TAE.
	- [x] Release the training code of original motion generation model and streaming generation model (MotionStreamer).
	- [x] Release the checkpoint and demo inference code of original motion generation model.
	- [ ] Release complete code for MotionStreamer.

	## 🏃 Motion Representation
	For more details of how to obtain the 272-dim motion representation, as well as other useful tools (e.g., Visualization and Conversion to BVH format), please refer to our [GitHub repo](https://github.com/Li-xingXiao/272-dim-Motion-Representation).

	## Installation

	### 🐍 Python Virtual Environment
	```sh
	conda env create -f environment.yaml
	conda activate mgpt
	```

	### 🤗 Hugging Face Mirror
	Since all of our models and data are available on Hugging Face, if Hugging Face is not directly accessible, you can use the HF-mirror tools following:
	```sh
	pip install -U huggingface_hub
	export HF_ENDPOINT=https://hf-mirror.com
	```

	## 📥 Data Preparation
	To facilitate researchers, we provide the processed 272-dim Motion Representation of:
	> HumanML3D dataset at [this link](https://huggingface.co/datasets/lxxiao/272-dim-HumanML3D).

	> BABEL dataset at [this link](https://huggingface.co/datasets/lxxiao/272-dim-BABEL).

	❗️❗️❗️ The processed data is solely for academic purposes. Make sure you read through the [AMASS License](https://amass.is.tue.mpg.de/license.html).

	1. Download the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset following:
	```bash
	huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-HumanML3D --local-dir ./humanml3d_272
	cd ./humanml3d_272
	unzip texts.zip
	unzip motion_data.zip
	```
	The dataset is organized as:
	```
	./humanml3d_272
	├── mean_std
	├── Mean.npy
	├── Std.npy
	├── split
	├── train.txt
	├── val.txt
	├── test.txt
	├── texts
	├── 000000.txt
	...
	├── motion_data
	├── 000000.npy
	...
	```

	2. Download the processed 272-dim [BABEL](https://babel.is.tue.mpg.de/) dataset following:
	```bash
	huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL --local-dir ./babel_272
	cd ./babel_272
	unzip texts.zip
	unzip motion_data.zip
	```
	The dataset is organized as:
	```
	./babel_272
	├── t2m_babel_mean_std
	├── Mean.npy
	├── Std.npy
	├── split
	├── train.txt
	├── val.txt
	├── texts
	├── 000000.txt
	...
	├── motion_data
	├── 000000.npy
	...
	```

	3. Download the processed streaming 272-dim [BABEL](https://babel.is.tue.mpg.de/) dataset following:
	```bash
	huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL-stream --local-dir ./babel_272_stream
	cd ./babel_272_stream
	unzip train_stream.zip
	unzip train_stream_text.zip
	unzip val_stream.zip
	unzip val_stream_text.zip
	```
	The dataset is organized as:
	```
	./babel_272_stream
	├── train_stream
	├── seq1.npy
	...
	├── train_stream_text
	├── seq1.txt
	...
	├── val_stream
	├── seq1.npy
	...
	├── val_stream_text
	├── seq1.txt
	...
	```
	> NOTE: We process the original BABEL dataset to support training of streaming motion generation. e.g. If there is a motion sequence A, annotated as (A1, A2, A3, A4) in BABEL dataset, each subsequence has text description: (A1_t, A2_t, A3_t, A4_t).

	> Then, our BABEL-stream is constructed as:

	> seq1: (A1, A2) --- seq1_text: (A1_t*A2_t#A1_length)

	> seq2: (A2, A3) --- seq2_text: (A2_t*A3_t#A2_length)

	> seq3: (A3, A4) --- seq3_text: (A3_t*A4_t#A3_length)

	> Here, * and # is separation symbol, A1_length means the number of frames of subsequence A1.

	## 🚀 Training
	1. Train our [TMR](https://github.com/Mathux/TMR)-based motion evaluator on the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset:
	```bash
	bash TRAIN_evaluator_272.sh
	```
	>After training for 100 epochs, the checkpoint will be stored at:
	``Evaluator_272/experiments/temos/EXP1/checkpoints/``.

	⬇️ We provide the evaluator checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Evaluator_272), download it following:
	```bash
	python humanml3d_272/prepare/download_evaluator_ckpt.py
	```
	>The downloaded checkpoint will be stored at: ``Evaluator_272/``.
	2. Train the Causal TAE:
	```bash
	bash TRAIN_causal_TAE.sh ${NUM_GPUS}
	```
	> e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8

	> The checkpoint will be stored at:
	``Experiments/causal_TAE_t2m_272/``

	> Tensorboard visualization:
	```bash
	tensorboard --logdir='Experiments/causal_TAE_t2m_272'
	```

	⬇️ We provide the Causal TAE checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Causal_TAE), download it following:
	```bash
	python humanml3d_272/prepare/download_Causal_TAE_t2m_272_ckpt.py
	```

	3. Train text to motion model:
	> We provide scripts to train the original text to motion generation model with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (trained in the first stage).

	3.1 Get motion latents:
	```bash
	python get_latent.py --resume-pth Causal_TAE/net_last.pth --latent_dir humanml3d_272/t2m_latents
	```
	3.2 Download [sentence-T5-XXL model](https://huggingface.co/sentence-transformers/sentence-t5-xxl/tree/main) on Hugging Face:
	```bash
	huggingface-cli download --resume-download sentence-transformers/sentence-t5-xxl --local-dir sentencet5-xxl/
	```
	3.3 Train text to motion generation model:
	```bash
	bash TRAIN_t2m.sh ${NUM_GPUS}
	```
	> e.g., if you have 8 GPUs, run: bash TRAIN_t2m.sh 8

	> The checkpoint will be stored at:
	``Experiments/t2m_model/``

	> Tensorboard visualization:
	```bash
	tensorboard --logdir='Experiments/t2m_model'
	```

	⬇️ We provide the text to motion model checkpoint on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Experiments/t2m_model), download it following:
	```bash
	python humanml3d_272/prepare/download_t2m_model_ckpt.py
	```

	4. Train streaming motion generation model (MotionStreamer):
	> We provide scripts to train the streaming motion generation model (MotionStreamer) with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (need to train a new Causal TAE using both HumanML3D-272 and BABEL-272 data).

	4.1 Train a Causal TAE using both HumanML3D-272 and BABEL-272 data:
	```bash
	bash TRAIN_causal_TAE.sh ${NUM_GPUS} t2m_babel_272
	```
	> e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8 t2m_babel_272

	> The checkpoint will be stored at:
	``Experiments/causal_TAE_t2m_babel_272/``

	> Tensorboard visualization:
	```bash
	tensorboard --logdir='Experiments/causal_TAE_t2m_babel_272'
	```

	⬇️ We provide the Causal TAE checkpoint trained using both HumanML3D-272 and BABEL-272 data on [Hugging Face](https://huggingface.co/lxxiao/MotionStreamer/tree/main/Causal_TAE_t2m_babel), download it following:
	```bash
	python humanml3d_272/prepare/download_Causal_TAE_t2m_babel_272_ckpt.py
	```

	4.2 Get motion latents of both HumanML3D-272 and the processed BABEL-272-stream dataset:
	```bash
	python get_latent.py --resume-pth Causal_TAE_t2m_babel/net_last.pth --latent_dir babel_272_stream/t2m_babel_latents --dataname t2m_babel_272
	```

	4.3 Train MotionStreamer model:
	```bash
	bash TRAIN_motionstreamer.sh ${NUM_GPUS}
	```
	> e.g., if you have 8 GPUs, run: bash TRAIN_motionstreamer.sh 8

	> The checkpoint will be stored at:
	``Experiments/motionstreamer_model/``

	> Tensorboard visualization:
	```bash
	tensorboard --logdir='Experiments/motionstreamer_model'
	```

	## 📍 Evaluation

	1. Evaluate the metrics of the processed 272-dim [HumanML3D](https://github.com/EricGuo5513/HumanML3D) dataset:
	```bash
	bash EVAL_GT.sh
	```
	( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )

	2. Evaluate the metrics of Causal TAE:
	```bash
	bash EVAL_causal_TAE.sh
	```
	( FID and MPJPE (mm) are reported. )

	3. Evaluate the metrics of text to motion model:
	```bash
	bash EVAL_t2m.sh
	```
	( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )


	## 🎬 Demo Inference

	1. Inference of text to motion model:
	> [Option1] Recover from joint position
	```bash
	python demo_t2m.py --text 'a person is walking like a mummy.' --mode pos --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
	```
	> [Option2] Recover from joint rotation
	```bash
	python demo_t2m.py --text 'a person is walking like a mummy.' --mode rot --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
	```
	> In our 272-dim representation, Inverse Kinematics (IK) is not needed.
	> For further conversion to BVH format, please refer to [this repo](https://github.com/Li-xingXiao/272-dim-Motion-Representation?tab=readme-ov-file#6-representation_272-to-bvh-conversion-optional) (Step 6: Representation_272 to BVH conversion). The BVH format of motion animation can be visualizd and edited in [Blender](https://www.blender.org/features/animation/).




	## 🌹 Acknowledgement
	This repository builds upon the following awesome datasets and projects:
	- [272-dim-Motion-Representation](https://github.com/Li-xingXiao/272-dim-Motion-Representation)
	- [AMASS](https://amass.is.tue.mpg.de/index.html)
	- [HumanML3D](https://github.com/EricGuo5513/HumanML3D)
	- [T2M-GPT](https://github.com/Mael-zys/T2M-GPT)
	- [TMR](https://github.com/Mathux/TMR)
	- [OpenTMA](https://github.com/LinghaoChan/OpenTMA)
	- [Sigma-VAE](https://github.com/orybkin/sigma-vae-pytorch)
	- [Scamo](https://github.com/shunlinlu/ScaMo_code)

	## 🤝🏼 Citation
	If our project is helpful for your research, please consider citing :
	```
	@article{xiao2025motionstreamer,
	title={MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
	author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
	journal={arXiv preprint arXiv:2503.15451},
	year={2025}
	}
	```

	## Star History

	[![Star History Chart](https://api.star-history.com/svg?repos=zju3dv/MotionStreamer&type=Date)](https://www.star-history.com/#zju3dv/MotionStreamer&Date)