Update README.md

825daaa verified about 16 hours ago

5.55 kB

	# PRISM-TP2M-1.4B

	<p align="center">
	<a href="https://arxiv.org/abs/2603.08590">
	<img src="https://img.shields.io/badge/Paper-ArXiv-B31B1B?style=for-the-badge&logo=arxiv" alt="Paper"/>
	</a>
	<a href="https://github.com/ZeyuLing/PRISM">
	<img src="https://img.shields.io/badge/Code-GitHub-181717?style=for-the-badge&logo=github" alt="GitHub"/>
	</a>
	<a href="https://www.youtube.com/watch?v=3PBFpYcwGIM">
	<img src="https://img.shields.io/badge/Demo-YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="YouTube"/>
	</a>
	</p>

	<p align="center"><b>PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition</b></p>
	<p align="center"><b>Zeyu Ling</b>, <b>Qing Shuai</b>, <b>Teng Zhang</b>, <b>Shiyang Li</b>, <b>Bo Han</b>, <b>Changqing Zou</b></p>

	---

	## Abstract

	Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts.

	We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time × joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space, without modifying the generator, substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep 0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.

	---

	## Demo

	<p align="center">
	<a href="https://www.youtube.com/watch?v=3PBFpYcwGIM">
	<img src="https://img.youtube.com/vi/3PBFpYcwGIM/maxresdefault.jpg" alt="PRISM Demo Video" width="720"/>
	</a>
	</p>
	<p align="center">
	<a href="https://www.youtube.com/watch?v=3PBFpYcwGIM">
	<img src="https://img.shields.io/badge/▶%20Play%20Demo-YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Play Demo"/>
	</a>
	</p>

	---

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Architecture \| Flow-matching DiT transformer with causal spatio-temporal Motion VAE \|
	\| Text encoder \| UMT5 (T5-style) \|
	\| Parameters \| ~1.4B \|
	\| Output \| SMPL/SMPL-X body parameters (22 joints, rotation_6d, 30 fps) \|

	---

	## Download

	This Hugging Face repo contains the pretrained weights. Download to use locally:

	```bash
	pip install huggingface_hub
	huggingface-cli download ZeyuLing/PRISM-TP2M-1.4B --local-dir pretrained_models/prism_1.4b
	```

	Or in Python:

	```python
	from huggingface_hub import snapshot_download
	snapshot_download("ZeyuLing/PRISM-TP2M-1.4B", local_dir="pretrained_models/prism_1.4b")
	```

	For full inference scripts and setup, use the [GitHub repository](https://github.com/ZeyuLing/PRISM) (designed to run inside [versatilemotion](https://github.com/ZeyuLing/versatilemotion)).

	---

	## Usage

	Load from local checkpoint:

	```python
	from mmotion.pipelines.prism_from_pretrained import load_prism_pipeline_from_pretrained

	pipe = load_prism_pipeline_from_pretrained("path/to/pretrained_models/prism_1.4b")
	```

	Text-to-Motion (single segment):

	```python
	smplx_dict = pipe(
	prompts="A person walks forward and waves.",
	negative_prompt="",
	num_frames_per_segment=129,
	num_joints=23,
	guidance_scale=5.0,
	)
	```

	Sequential multi-segment:

	```python
	smplx_dict = pipe(
	prompts=["A person waves.", "A person walks.", "A person bows."],
	num_frames_per_segment=[97, 129, 97],
	guidance_scale=5.0,
	)
	```

	Pose-conditioned (TP2M):

	```python
	smplx_dict = pipe(
	prompts="The person stands up and walks.",
	first_frame_motion_path="/path/to/first_frame.npz",
	num_frames_per_segment=129,
	guidance_scale=5.0,
	)
	```

	---

	## Requirements

	- Python ≥ 3.9
	- PyTorch (CUDA recommended)
	- transformers, diffusers, einops, mmengine
	- SMPL/SMPL-X body model (for full mesh rendering)

	---

	## Citation

	```bibtex
	@article{ling2026prism,
	title={PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition},
	author={Ling, Zeyu and Shuai, Qing and Zhang, Teng and Li, Shiyang and Han, Bo and Zou, Changqing},
	journal={arXiv preprint arXiv:2603.08590},
	year={2026},
	url={https://arxiv.org/abs/2603.08590}
	}
	```

	Links: [Paper](https://arxiv.org/abs/2603.08590) · [Code](https://github.com/ZeyuLing/PRISM) · [Demo](https://www.youtube.com/watch?v=3PBFpYcwGIM)

	---

	## License

	See the [main repository](https://github.com/ZeyuLing/PRISM) for license terms.