File size: 5,547 Bytes
33179e5 6f13478 825daaa 6f13478 825daaa 6f13478 825daaa 6f13478 825daaa 6f13478 33179e5 825daaa 6f13478 33179e5 825daaa 33179e5 825daaa 33179e5 6f13478 33179e5 825daaa 33179e5 825daaa 33179e5 825daaa 33179e5 6f13478 33179e5 825daaa 33179e5 6f13478 33179e5 6f13478 33179e5 6f13478 33179e5 825daaa 6f13478 33179e5 6f13478 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | # PRISM-TP2M-1.4B
<p align="center">
<a href="https://arxiv.org/abs/2603.08590">
<img src="https://img.shields.io/badge/Paper-ArXiv-B31B1B?style=for-the-badge&logo=arxiv" alt="Paper"/>
</a>
<a href="https://github.com/ZeyuLing/PRISM">
<img src="https://img.shields.io/badge/Code-GitHub-181717?style=for-the-badge&logo=github" alt="GitHub"/>
</a>
<a href="https://www.youtube.com/watch?v=3PBFpYcwGIM">
<img src="https://img.shields.io/badge/Demo-YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="YouTube"/>
</a>
</p>
<p align="center"><b>PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition</b></p>
<p align="center"><b>Zeyu Ling</b>, <b>Qing Shuai</b>, <b>Teng Zhang</b>, <b>Shiyang Li</b>, <b>Bo Han</b>, <b>Changqing Zou</b></p>
---
## Abstract
Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts.
We present PRISM, addressing each challenge with a dedicated contribution. **(1) A joint-factorized motion latent space**: each body joint occupies its own token, forming a structured 2D grid (time × joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space, without modifying the generator, substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. **(2) Noise-free condition injection**: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep 0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.
---
## Demo
<p align="center">
<a href="https://www.youtube.com/watch?v=3PBFpYcwGIM">
<img src="https://img.youtube.com/vi/3PBFpYcwGIM/maxresdefault.jpg" alt="PRISM Demo Video" width="720"/>
</a>
</p>
<p align="center">
<a href="https://www.youtube.com/watch?v=3PBFpYcwGIM">
<img src="https://img.shields.io/badge/▶%20Play%20Demo-YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Play Demo"/>
</a>
</p>
---
## Model Details
| | |
|---|---|
| **Architecture** | Flow-matching DiT transformer with causal spatio-temporal Motion VAE |
| **Text encoder** | UMT5 (T5-style) |
| **Parameters** | ~1.4B |
| **Output** | SMPL/SMPL-X body parameters (22 joints, rotation_6d, 30 fps) |
---
## Download
This Hugging Face repo contains the pretrained weights. Download to use locally:
```bash
pip install huggingface_hub
huggingface-cli download ZeyuLing/PRISM-TP2M-1.4B --local-dir pretrained_models/prism_1.4b
```
Or in Python:
```python
from huggingface_hub import snapshot_download
snapshot_download("ZeyuLing/PRISM-TP2M-1.4B", local_dir="pretrained_models/prism_1.4b")
```
For full inference scripts and setup, use the [GitHub repository](https://github.com/ZeyuLing/PRISM) (designed to run inside [versatilemotion](https://github.com/ZeyuLing/versatilemotion)).
---
## Usage
**Load from local checkpoint:**
```python
from mmotion.pipelines.prism_from_pretrained import load_prism_pipeline_from_pretrained
pipe = load_prism_pipeline_from_pretrained("path/to/pretrained_models/prism_1.4b")
```
**Text-to-Motion (single segment):**
```python
smplx_dict = pipe(
prompts="A person walks forward and waves.",
negative_prompt="",
num_frames_per_segment=129,
num_joints=23,
guidance_scale=5.0,
)
```
**Sequential multi-segment:**
```python
smplx_dict = pipe(
prompts=["A person waves.", "A person walks.", "A person bows."],
num_frames_per_segment=[97, 129, 97],
guidance_scale=5.0,
)
```
**Pose-conditioned (TP2M):**
```python
smplx_dict = pipe(
prompts="The person stands up and walks.",
first_frame_motion_path="/path/to/first_frame.npz",
num_frames_per_segment=129,
guidance_scale=5.0,
)
```
---
## Requirements
- Python ≥ 3.9
- PyTorch (CUDA recommended)
- transformers, diffusers, einops, mmengine
- SMPL/SMPL-X body model (for full mesh rendering)
---
## Citation
```bibtex
@article{ling2026prism,
title={PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition},
author={Ling, Zeyu and Shuai, Qing and Zhang, Teng and Li, Shiyang and Han, Bo and Zou, Changqing},
journal={arXiv preprint arXiv:2603.08590},
year={2026},
url={https://arxiv.org/abs/2603.08590}
}
```
**Links:** [Paper](https://arxiv.org/abs/2603.08590) · [Code](https://github.com/ZeyuLing/PRISM) · [Demo](https://www.youtube.com/watch?v=3PBFpYcwGIM)
---
## License
See the [main repository](https://github.com/ZeyuLing/PRISM) for license terms.
|