File size: 7,012 Bytes

---
license: apache-2.0
tags:
- text-to-motion
- motion-generation
- diffusion-forcing
- humanml3d
- computer-animation
library_name: transformers
pipeline_tag: other
---

# FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

<div align="center">

**A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing**

[Paper](https://arxiv.org/abs/2512.03520) | [Github](https://github.com/ShandaAI/FloodDiffusion) | [Project Page](https://shandaai.github.io/FloodDiffusion/)

</div>

## Overview

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.

## Model Architecture

The model consists of three main components:

1. **Text Encoder**: UMT5-XXL encoder for text feature extraction
2. **Latent Diffusion Model**: Transformer-based diffusion model operating in latent space
3. **VAE Decoder**: 1D convolutional VAE for decoding latent features to motion sequences

**Technical Specifications:**
- Input: Natural language text
- Output: Motion sequences in two formats:
  - 263-dimensional HumanML3D features (default)
  - 22×3 joint coordinates (optional, with EMA smoothing support)
- Latent dimension: 4
- Upsampling factor: 4× (VAE decoder)
- Frame rate: 20 FPS

## Installation

### Prerequisites

- Python 3.8+
- CUDA-capable GPU with 16GB+ VRAM (recommended)
- 16GB+ system RAM

### Dependencies

**Step 1: Install basic dependencies**

```bash
pip install torch transformers huggingface_hub
pip install lightning diffusers omegaconf ftfy numpy
```

**Step 2: Install Flash Attention (Required)**

Flash attention requires CUDA and may need compilation. Choose the appropriate method:

```bash
pip install flash-attn --no-build-isolation
```

**Note:** Flash attention is **required** for this model. If installation fails, please refer to the [official flash-attention installation guide](https://github.com/Dao-AILab/flash-attention#installation-and-features).

## Quick Start

### Basic Usage

```python
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True
)

# Generate motion from text (263-dim HumanML3D features)
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}")  # (~240, 263)

# Generate motion as joint coordinates (22 joints × 3 coords) with ema (alpha: 0.0-1.0)
motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5)
print(f"Generated joints: {motion_joints.shape}")  # (~240, 22, 3)
```

### Batch Generation

```python
# Generate multiple motions efficiently
texts = [
    "a person walking forward",
    "a person running quickly", 
    "a person jumping up and down"
]
lengths = [60, 50, 40]  # Different lengths for each motion

motions = model(texts, length=lengths)

for i, motion in enumerate(motions):
    print(f"Motion {i}: {motion.shape}")
```

### Multi-Text Motion Transitions

```python
# Generate a motion sequence with smooth transitions between actions
motion = model(
    text=[["walk forward", "turn around", "run back"]],
    length=[120],
    text_end=[[40, 80, 120]]  # Transition points in latent tokens
)

# Output: ~480 frames showing all three actions smoothly connected
print(f"Transition motion: {motion[0].shape}")
```

## API Reference

### `model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)`

Generate motion sequences from text descriptions.

**Parameters:**

- **text** (`str`, `List[str]`, or `List[List[str]]`): Text description(s)
  - Single string: Generate one motion
  - List of strings: Batch generation
  - Nested list: Multiple text prompts per motion (for transitions)

- **length** (`int` or `List[int]`, default=60): Number of latent tokens to generate
  - Output frames ≈ `length × 4` (due to VAE upsampling)
  - Example: `length=60` → ~240 frames (~12 seconds at 20 FPS)

- **text_end** (`List[int]` or `List[List[int]]`, optional): Latent token positions for text transitions
  - Only used when `text` is a nested list
  - Specifies when to switch between different text descriptions
  - **IMPORTANT**: Must have the same length as the corresponding text list
    - Example: `text=[["walk", "turn", "sit"]]` requires `text_end=[[20, 40, 60]]` (3 endpoints for 3 texts)
  - Must be in ascending order

- **num_denoise_steps** (`int`, optional): Number of denoising iterations
  - Higher values produce better quality but slower generation
  - Recommended range: 10-50

- **output_joints** (`bool`, default=False): Output format selector
  - `False`: Returns 263-dimensional HumanML3D features
  - `True`: Returns 22×3 joint coordinates for direct visualization

- **smoothing_alpha** (`float`, default=1.0): EMA smoothing factor for joint positions (only used when `output_joints=True`)
  - `1.0`: No smoothing (default)
  - `0.5`: Medium smoothing (recommended for smoother animations)
  - `0.0`: Maximum smoothing
  - Range: 0.0 to 1.0

**Returns:**
- Single motion: 
  - `output_joints=False`: `numpy.ndarray` of shape `(frames, 263)`
  - `output_joints=True`: `numpy.ndarray` of shape `(frames, 22, 3)`
- Batch: `List[numpy.ndarray]` with shapes as above

**Example:**
```python
# Single generation (263-dim features)
motion = model("walk forward", length=60)  # Returns (240, 263)

# Single generation (joint coordinates)
joints = model("walk forward", length=60, output_joints=True)  # Returns (240, 22, 3)

# Batch generation
motions = model(["walk", "run"], length=[60, 50])  # Returns list of 2 arrays

# Multi-text transitions
motion = model(
    [["walk", "turn"]],
    length=[60],
    text_end=[[30, 60]]
)  # Returns list with 1 array of shape (240, 263)
```

## Update History

- **2025/12/8**: Added EMA smoothing option for joint positions during rendering

## Citation

If you use this model in your research, please cite:

```bibtex
@article{cai2025flooddiffusion,
  title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
  author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
  journal={arXiv preprint arXiv:2512.03520},
  year={2025}
}
```

## Troubleshooting

### Common Issues

**ImportError with trust_remote_code:**
```python
# Solution: Add trust_remote_code=True
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True  # Required!
)
```

**Out of Memory:**
```python
# Solution: Generate shorter sequences
motion = model("walk", length=30)  # Shorter = less memory
```

**Slow first load:**
The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant.

**Module import errors:**
Ensure all dependencies are installed:
```bash
pip install lightning diffusers omegaconf ftfy numpy
```