File size: 5,347 Bytes
e86746e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
---
license: apache-2.0
tags:
- text-to-motion
- motion-generation
- diffusion-forcing
- humanml3d
- computer-animation
library_name: transformers
pipeline_tag: other
---
# FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
<div align="center">
**A TINY version of the original FloodDiffusion**
[Paper](https://arxiv.org/abs/2512.03520) | [Github](https://github.com/ShandaAI/FloodDiffusion) | [Project Page](https://shandaai.github.io/FloodDiffusion/)
</div>
## Installation
### Prerequisites
- Python 3.8+
- CUDA-capable GPU with 16GB+ VRAM (recommended)
- 16GB+ system RAM
### Dependencies
**Step 1: Install basic dependencies**
```bash
pip install torch transformers huggingface_hub
pip install lightning diffusers omegaconf ftfy numpy
```
**Step 2: Install Flash Attention (Required)**
Flash attention requires CUDA and may need compilation. Choose the appropriate method:
```bash
pip install flash-attn --no-build-isolation
```
**Note:** Flash attention is **required** for this model. If installation fails, please refer to the [official flash-attention installation guide](https://github.com/Dao-AILab/flash-attention#installation-and-features).
## Quick Start
### Basic Usage
```python
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained(
"ShandaAI/FloodDiffusionTiny",
trust_remote_code=True
)
# Generate motion from text (263-dim HumanML3D features)
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}") # (~240, 263)
# Generate motion as joint coordinates (22 joints × 3 coords) with ema (alpha: 0.0-1.0)
motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5)
print(f"Generated joints: {motion_joints.shape}") # (~240, 22, 3)
```
### Batch Generation
```python
# Generate multiple motions efficiently
texts = [
"a person walking forward",
"a person running quickly",
"a person jumping up and down"
]
lengths = [60, 50, 40] # Different lengths for each motion
motions = model(texts, length=lengths)
for i, motion in enumerate(motions):
print(f"Motion {i}: {motion.shape}")
```
### Multi-Text Motion Transitions
```python
# Generate a motion sequence with smooth transitions between actions
motion = model(
text=[["walk forward", "turn around", "run back"]],
length=[120],
text_end=[[40, 80, 120]] # Transition points in latent tokens
)
# Output: ~480 frames showing all three actions smoothly connected
print(f"Transition motion: {motion[0].shape}")
```
## API Reference
### `model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)`
Generate motion sequences from text descriptions.
**Parameters:**
- **text** (`str`, `List[str]`, or `List[List[str]]`): Text description(s)
- Single string: Generate one motion
- List of strings: Batch generation
- Nested list: Multiple text prompts per motion (for transitions)
- **length** (`int` or `List[int]`, default=60): Number of latent tokens to generate
- Output frames ≈ `length × 4` (due to VAE upsampling)
- Example: `length=60` → ~240 frames (~12 seconds at 20 FPS)
- **text_end** (`List[int]` or `List[List[int]]`, optional): Latent token positions for text transitions
- Only used when `text` is a nested list
- Specifies when to switch between different text descriptions
- **IMPORTANT**: Must have the same length as the corresponding text list
- Example: `text=[["walk", "turn", "sit"]]` requires `text_end=[[20, 40, 60]]` (3 endpoints for 3 texts)
- Must be in ascending order
- **num_denoise_steps** (`int`, optional): Number of denoising iterations
- Higher values produce better quality but slower generation
- Recommended range: 10-50
- **output_joints** (`bool`, default=False): Output format selector
- `False`: Returns 263-dimensional HumanML3D features
- `True`: Returns 22×3 joint coordinates for direct visualization
- **smoothing_alpha** (`float`, default=1.0): EMA smoothing factor for joint positions (only used when `output_joints=True`)
- `1.0`: No smoothing (default)
- `0.5`: Medium smoothing (recommended for smoother animations)
- `0.0`: Maximum smoothing
- Range: 0.0 to 1.0
**Returns:**
- Single motion:
- `output_joints=False`: `numpy.ndarray` of shape `(frames, 263)`
- `output_joints=True`: `numpy.ndarray` of shape `(frames, 22, 3)`
- Batch: `List[numpy.ndarray]` with shapes as above
**Example:**
```python
# Single generation (263-dim features)
motion = model("walk forward", length=60) # Returns (240, 263)
# Single generation (joint coordinates)
joints = model("walk forward", length=60, output_joints=True) # Returns (240, 22, 3)
# Batch generation
motions = model(["walk", "run"], length=[60, 50]) # Returns list of 2 arrays
# Multi-text transitions
motion = model(
[["walk", "turn"]],
length=[60],
text_end=[[30, 60]]
) # Returns list with 1 array of shape (240, 263)
```
## Citation
If you use this model in your research, please cite:
```bibtex
@article{cai2025flooddiffusion,
title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
journal={arXiv preprint arXiv:2512.03520},
year={2025}
}
``` |