File size: 7,012 Bytes
ebc7f2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a98c3d
ebc7f2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82d5f99
ebc7f2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82d5f99
 
ebc7f2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82d5f99
ebc7f2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82d5f99
 
 
 
 
 
ebc7f2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82d5f99
 
 
 
ebc7f2e
 
 
 
 
3a98c3d
ebc7f2e
3a98c3d
 
ebc7f2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
license: apache-2.0
tags:
- text-to-motion
- motion-generation
- diffusion-forcing
- humanml3d
- computer-animation
library_name: transformers
pipeline_tag: other
---

# FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

<div align="center">

**A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing**

[Paper](https://arxiv.org/abs/2512.03520) | [Github](https://github.com/ShandaAI/FloodDiffusion) | [Project Page](https://shandaai.github.io/FloodDiffusion/)

</div>

## Overview

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.

## Model Architecture

The model consists of three main components:

1. **Text Encoder**: UMT5-XXL encoder for text feature extraction
2. **Latent Diffusion Model**: Transformer-based diffusion model operating in latent space
3. **VAE Decoder**: 1D convolutional VAE for decoding latent features to motion sequences

**Technical Specifications:**
- Input: Natural language text
- Output: Motion sequences in two formats:
  - 263-dimensional HumanML3D features (default)
  - 22×3 joint coordinates (optional, with EMA smoothing support)
- Latent dimension: 4
- Upsampling factor: 4× (VAE decoder)
- Frame rate: 20 FPS

## Installation

### Prerequisites

- Python 3.8+
- CUDA-capable GPU with 16GB+ VRAM (recommended)
- 16GB+ system RAM

### Dependencies

**Step 1: Install basic dependencies**

```bash
pip install torch transformers huggingface_hub
pip install lightning diffusers omegaconf ftfy numpy
```

**Step 2: Install Flash Attention (Required)**

Flash attention requires CUDA and may need compilation. Choose the appropriate method:

```bash
pip install flash-attn --no-build-isolation
```

**Note:** Flash attention is **required** for this model. If installation fails, please refer to the [official flash-attention installation guide](https://github.com/Dao-AILab/flash-attention#installation-and-features).

## Quick Start

### Basic Usage

```python
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True
)

# Generate motion from text (263-dim HumanML3D features)
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}")  # (~240, 263)

# Generate motion as joint coordinates (22 joints × 3 coords) with ema (alpha: 0.0-1.0)
motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5)
print(f"Generated joints: {motion_joints.shape}")  # (~240, 22, 3)
```

### Batch Generation

```python
# Generate multiple motions efficiently
texts = [
    "a person walking forward",
    "a person running quickly", 
    "a person jumping up and down"
]
lengths = [60, 50, 40]  # Different lengths for each motion

motions = model(texts, length=lengths)

for i, motion in enumerate(motions):
    print(f"Motion {i}: {motion.shape}")
```

### Multi-Text Motion Transitions

```python
# Generate a motion sequence with smooth transitions between actions
motion = model(
    text=[["walk forward", "turn around", "run back"]],
    length=[120],
    text_end=[[40, 80, 120]]  # Transition points in latent tokens
)

# Output: ~480 frames showing all three actions smoothly connected
print(f"Transition motion: {motion[0].shape}")
```

## API Reference

### `model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)`

Generate motion sequences from text descriptions.

**Parameters:**

- **text** (`str`, `List[str]`, or `List[List[str]]`): Text description(s)
  - Single string: Generate one motion
  - List of strings: Batch generation
  - Nested list: Multiple text prompts per motion (for transitions)

- **length** (`int` or `List[int]`, default=60): Number of latent tokens to generate
  - Output frames ≈ `length × 4` (due to VAE upsampling)
  - Example: `length=60` → ~240 frames (~12 seconds at 20 FPS)

- **text_end** (`List[int]` or `List[List[int]]`, optional): Latent token positions for text transitions
  - Only used when `text` is a nested list
  - Specifies when to switch between different text descriptions
  - **IMPORTANT**: Must have the same length as the corresponding text list
    - Example: `text=[["walk", "turn", "sit"]]` requires `text_end=[[20, 40, 60]]` (3 endpoints for 3 texts)
  - Must be in ascending order

- **num_denoise_steps** (`int`, optional): Number of denoising iterations
  - Higher values produce better quality but slower generation
  - Recommended range: 10-50

- **output_joints** (`bool`, default=False): Output format selector
  - `False`: Returns 263-dimensional HumanML3D features
  - `True`: Returns 22×3 joint coordinates for direct visualization

- **smoothing_alpha** (`float`, default=1.0): EMA smoothing factor for joint positions (only used when `output_joints=True`)
  - `1.0`: No smoothing (default)
  - `0.5`: Medium smoothing (recommended for smoother animations)
  - `0.0`: Maximum smoothing
  - Range: 0.0 to 1.0

**Returns:**
- Single motion: 
  - `output_joints=False`: `numpy.ndarray` of shape `(frames, 263)`
  - `output_joints=True`: `numpy.ndarray` of shape `(frames, 22, 3)`
- Batch: `List[numpy.ndarray]` with shapes as above

**Example:**
```python
# Single generation (263-dim features)
motion = model("walk forward", length=60)  # Returns (240, 263)

# Single generation (joint coordinates)
joints = model("walk forward", length=60, output_joints=True)  # Returns (240, 22, 3)

# Batch generation
motions = model(["walk", "run"], length=[60, 50])  # Returns list of 2 arrays

# Multi-text transitions
motion = model(
    [["walk", "turn"]],
    length=[60],
    text_end=[[30, 60]]
)  # Returns list with 1 array of shape (240, 263)
```

## Update History

- **2025/12/8**: Added EMA smoothing option for joint positions during rendering

## Citation

If you use this model in your research, please cite:

```bibtex
@article{cai2025flooddiffusion,
  title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
  author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
  journal={arXiv preprint arXiv:2512.03520},
  year={2025}
}
```

## Troubleshooting

### Common Issues

**ImportError with trust_remote_code:**
```python
# Solution: Add trust_remote_code=True
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True  # Required!
)
```

**Out of Memory:**
```python
# Solution: Generate shorter sequences
motion = model("walk", length=30)  # Shorter = less memory
```

**Slow first load:**
The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant.

**Module import errors:**
Ensure all dependencies are installed:
```bash
pip install lightning diffusers omegaconf ftfy numpy
```