File size: 7,012 Bytes
ebc7f2e 3a98c3d ebc7f2e 82d5f99 ebc7f2e 82d5f99 ebc7f2e 82d5f99 ebc7f2e 82d5f99 ebc7f2e 82d5f99 ebc7f2e 3a98c3d ebc7f2e 3a98c3d ebc7f2e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 |
---
license: apache-2.0
tags:
- text-to-motion
- motion-generation
- diffusion-forcing
- humanml3d
- computer-animation
library_name: transformers
pipeline_tag: other
---
# FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
<div align="center">
**A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing**
[Paper](https://arxiv.org/abs/2512.03520) | [Github](https://github.com/ShandaAI/FloodDiffusion) | [Project Page](https://shandaai.github.io/FloodDiffusion/)
</div>
## Overview
We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.
## Model Architecture
The model consists of three main components:
1. **Text Encoder**: UMT5-XXL encoder for text feature extraction
2. **Latent Diffusion Model**: Transformer-based diffusion model operating in latent space
3. **VAE Decoder**: 1D convolutional VAE for decoding latent features to motion sequences
**Technical Specifications:**
- Input: Natural language text
- Output: Motion sequences in two formats:
- 263-dimensional HumanML3D features (default)
- 22×3 joint coordinates (optional, with EMA smoothing support)
- Latent dimension: 4
- Upsampling factor: 4× (VAE decoder)
- Frame rate: 20 FPS
## Installation
### Prerequisites
- Python 3.8+
- CUDA-capable GPU with 16GB+ VRAM (recommended)
- 16GB+ system RAM
### Dependencies
**Step 1: Install basic dependencies**
```bash
pip install torch transformers huggingface_hub
pip install lightning diffusers omegaconf ftfy numpy
```
**Step 2: Install Flash Attention (Required)**
Flash attention requires CUDA and may need compilation. Choose the appropriate method:
```bash
pip install flash-attn --no-build-isolation
```
**Note:** Flash attention is **required** for this model. If installation fails, please refer to the [official flash-attention installation guide](https://github.com/Dao-AILab/flash-attention#installation-and-features).
## Quick Start
### Basic Usage
```python
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained(
"ShandaAI/FloodDiffusion",
trust_remote_code=True
)
# Generate motion from text (263-dim HumanML3D features)
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}") # (~240, 263)
# Generate motion as joint coordinates (22 joints × 3 coords) with ema (alpha: 0.0-1.0)
motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5)
print(f"Generated joints: {motion_joints.shape}") # (~240, 22, 3)
```
### Batch Generation
```python
# Generate multiple motions efficiently
texts = [
"a person walking forward",
"a person running quickly",
"a person jumping up and down"
]
lengths = [60, 50, 40] # Different lengths for each motion
motions = model(texts, length=lengths)
for i, motion in enumerate(motions):
print(f"Motion {i}: {motion.shape}")
```
### Multi-Text Motion Transitions
```python
# Generate a motion sequence with smooth transitions between actions
motion = model(
text=[["walk forward", "turn around", "run back"]],
length=[120],
text_end=[[40, 80, 120]] # Transition points in latent tokens
)
# Output: ~480 frames showing all three actions smoothly connected
print(f"Transition motion: {motion[0].shape}")
```
## API Reference
### `model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)`
Generate motion sequences from text descriptions.
**Parameters:**
- **text** (`str`, `List[str]`, or `List[List[str]]`): Text description(s)
- Single string: Generate one motion
- List of strings: Batch generation
- Nested list: Multiple text prompts per motion (for transitions)
- **length** (`int` or `List[int]`, default=60): Number of latent tokens to generate
- Output frames ≈ `length × 4` (due to VAE upsampling)
- Example: `length=60` → ~240 frames (~12 seconds at 20 FPS)
- **text_end** (`List[int]` or `List[List[int]]`, optional): Latent token positions for text transitions
- Only used when `text` is a nested list
- Specifies when to switch between different text descriptions
- **IMPORTANT**: Must have the same length as the corresponding text list
- Example: `text=[["walk", "turn", "sit"]]` requires `text_end=[[20, 40, 60]]` (3 endpoints for 3 texts)
- Must be in ascending order
- **num_denoise_steps** (`int`, optional): Number of denoising iterations
- Higher values produce better quality but slower generation
- Recommended range: 10-50
- **output_joints** (`bool`, default=False): Output format selector
- `False`: Returns 263-dimensional HumanML3D features
- `True`: Returns 22×3 joint coordinates for direct visualization
- **smoothing_alpha** (`float`, default=1.0): EMA smoothing factor for joint positions (only used when `output_joints=True`)
- `1.0`: No smoothing (default)
- `0.5`: Medium smoothing (recommended for smoother animations)
- `0.0`: Maximum smoothing
- Range: 0.0 to 1.0
**Returns:**
- Single motion:
- `output_joints=False`: `numpy.ndarray` of shape `(frames, 263)`
- `output_joints=True`: `numpy.ndarray` of shape `(frames, 22, 3)`
- Batch: `List[numpy.ndarray]` with shapes as above
**Example:**
```python
# Single generation (263-dim features)
motion = model("walk forward", length=60) # Returns (240, 263)
# Single generation (joint coordinates)
joints = model("walk forward", length=60, output_joints=True) # Returns (240, 22, 3)
# Batch generation
motions = model(["walk", "run"], length=[60, 50]) # Returns list of 2 arrays
# Multi-text transitions
motion = model(
[["walk", "turn"]],
length=[60],
text_end=[[30, 60]]
) # Returns list with 1 array of shape (240, 263)
```
## Update History
- **2025/12/8**: Added EMA smoothing option for joint positions during rendering
## Citation
If you use this model in your research, please cite:
```bibtex
@article{cai2025flooddiffusion,
title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
journal={arXiv preprint arXiv:2512.03520},
year={2025}
}
```
## Troubleshooting
### Common Issues
**ImportError with trust_remote_code:**
```python
# Solution: Add trust_remote_code=True
model = AutoModel.from_pretrained(
"ShandaAI/FloodDiffusion",
trust_remote_code=True # Required!
)
```
**Out of Memory:**
```python
# Solution: Generate shorter sequences
motion = model("walk", length=30) # Shorter = less memory
```
**Slow first load:**
The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant.
**Module import errors:**
Ensure all dependencies are installed:
```bash
pip install lightning diffusers omegaconf ftfy numpy
``` |