File size: 5,347 Bytes
e86746e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: apache-2.0
tags:
- text-to-motion
- motion-generation
- diffusion-forcing
- humanml3d
- computer-animation
library_name: transformers
pipeline_tag: other
---

# FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

<div align="center">

**A TINY version of the original FloodDiffusion**

[Paper](https://arxiv.org/abs/2512.03520) | [Github](https://github.com/ShandaAI/FloodDiffusion) | [Project Page](https://shandaai.github.io/FloodDiffusion/)

</div>

## Installation

### Prerequisites

- Python 3.8+
- CUDA-capable GPU with 16GB+ VRAM (recommended)
- 16GB+ system RAM

### Dependencies

**Step 1: Install basic dependencies**

```bash
pip install torch transformers huggingface_hub
pip install lightning diffusers omegaconf ftfy numpy
```

**Step 2: Install Flash Attention (Required)**

Flash attention requires CUDA and may need compilation. Choose the appropriate method:

```bash
pip install flash-attn --no-build-isolation
```

**Note:** Flash attention is **required** for this model. If installation fails, please refer to the [official flash-attention installation guide](https://github.com/Dao-AILab/flash-attention#installation-and-features).

## Quick Start

### Basic Usage

```python
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusionTiny",
    trust_remote_code=True
)

# Generate motion from text (263-dim HumanML3D features)
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}")  # (~240, 263)

# Generate motion as joint coordinates (22 joints × 3 coords) with ema (alpha: 0.0-1.0)
motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5)
print(f"Generated joints: {motion_joints.shape}")  # (~240, 22, 3)
```

### Batch Generation

```python
# Generate multiple motions efficiently
texts = [
    "a person walking forward",
    "a person running quickly", 
    "a person jumping up and down"
]
lengths = [60, 50, 40]  # Different lengths for each motion

motions = model(texts, length=lengths)

for i, motion in enumerate(motions):
    print(f"Motion {i}: {motion.shape}")
```

### Multi-Text Motion Transitions

```python
# Generate a motion sequence with smooth transitions between actions
motion = model(
    text=[["walk forward", "turn around", "run back"]],
    length=[120],
    text_end=[[40, 80, 120]]  # Transition points in latent tokens
)

# Output: ~480 frames showing all three actions smoothly connected
print(f"Transition motion: {motion[0].shape}")
```

## API Reference

### `model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)`

Generate motion sequences from text descriptions.

**Parameters:**

- **text** (`str`, `List[str]`, or `List[List[str]]`): Text description(s)
  - Single string: Generate one motion
  - List of strings: Batch generation
  - Nested list: Multiple text prompts per motion (for transitions)

- **length** (`int` or `List[int]`, default=60): Number of latent tokens to generate
  - Output frames ≈ `length × 4` (due to VAE upsampling)
  - Example: `length=60` → ~240 frames (~12 seconds at 20 FPS)

- **text_end** (`List[int]` or `List[List[int]]`, optional): Latent token positions for text transitions
  - Only used when `text` is a nested list
  - Specifies when to switch between different text descriptions
  - **IMPORTANT**: Must have the same length as the corresponding text list
    - Example: `text=[["walk", "turn", "sit"]]` requires `text_end=[[20, 40, 60]]` (3 endpoints for 3 texts)
  - Must be in ascending order

- **num_denoise_steps** (`int`, optional): Number of denoising iterations
  - Higher values produce better quality but slower generation
  - Recommended range: 10-50

- **output_joints** (`bool`, default=False): Output format selector
  - `False`: Returns 263-dimensional HumanML3D features
  - `True`: Returns 22×3 joint coordinates for direct visualization

- **smoothing_alpha** (`float`, default=1.0): EMA smoothing factor for joint positions (only used when `output_joints=True`)
  - `1.0`: No smoothing (default)
  - `0.5`: Medium smoothing (recommended for smoother animations)
  - `0.0`: Maximum smoothing
  - Range: 0.0 to 1.0

**Returns:**
- Single motion: 
  - `output_joints=False`: `numpy.ndarray` of shape `(frames, 263)`
  - `output_joints=True`: `numpy.ndarray` of shape `(frames, 22, 3)`
- Batch: `List[numpy.ndarray]` with shapes as above

**Example:**
```python
# Single generation (263-dim features)
motion = model("walk forward", length=60)  # Returns (240, 263)

# Single generation (joint coordinates)
joints = model("walk forward", length=60, output_joints=True)  # Returns (240, 22, 3)

# Batch generation
motions = model(["walk", "run"], length=[60, 50])  # Returns list of 2 arrays

# Multi-text transitions
motion = model(
    [["walk", "turn"]],
    length=[60],
    text_end=[[30, 60]]
)  # Returns list with 1 array of shape (240, 263)
```

## Citation

If you use this model in your research, please cite:

```bibtex
@article{cai2025flooddiffusion,
  title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
  author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
  journal={arXiv preprint arXiv:2512.03520},
  year={2025}
}
```