Other
Transformers
Safetensors
ldf_motion
feature-extraction
text-to-motion
motion-generation
diffusion-forcing
humanml3d
computer-animation
custom_code
Instructions to use AlayaLab/FloodDiffusion with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AlayaLab/FloodDiffusion with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AlayaLab/FloodDiffusion", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - text-to-motion | |
| - motion-generation | |
| - diffusion-forcing | |
| - humanml3d | |
| - computer-animation | |
| library_name: transformers | |
| pipeline_tag: other | |
| # FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation | |
| <div align="center"> | |
| **A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing** | |
| [Paper](https://arxiv.org/abs/2512.03520) | [Github](https://github.com/ShandaAI/FloodDiffusion) | [Project Page](https://shandaai.github.io/FloodDiffusion/) | |
| </div> | |
| ## Overview | |
| We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. | |
| ## Model Architecture | |
| The model consists of three main components: | |
| 1. **Text Encoder**: UMT5-XXL encoder for text feature extraction | |
| 2. **Latent Diffusion Model**: Transformer-based diffusion model operating in latent space | |
| 3. **VAE Decoder**: 1D convolutional VAE for decoding latent features to motion sequences | |
| **Technical Specifications:** | |
| - Input: Natural language text | |
| - Output: Motion sequences in two formats: | |
| - 263-dimensional HumanML3D features (default) | |
| - 22×3 joint coordinates (optional, with EMA smoothing support) | |
| - Latent dimension: 4 | |
| - Upsampling factor: 4× (VAE decoder) | |
| - Frame rate: 20 FPS | |
| ## Installation | |
| ### Prerequisites | |
| - Python 3.8+ | |
| - CUDA-capable GPU with 16GB+ VRAM (recommended) | |
| - 16GB+ system RAM | |
| ### Dependencies | |
| **Step 1: Install basic dependencies** | |
| ```bash | |
| pip install torch transformers huggingface_hub | |
| pip install lightning diffusers omegaconf ftfy numpy | |
| ``` | |
| **Step 2: Install Flash Attention (Required)** | |
| Flash attention requires CUDA and may need compilation. Choose the appropriate method: | |
| ```bash | |
| pip install flash-attn --no-build-isolation | |
| ``` | |
| **Note:** Flash attention is **required** for this model. If installation fails, please refer to the [official flash-attention installation guide](https://github.com/Dao-AILab/flash-attention#installation-and-features). | |
| ## Quick Start | |
| ### Basic Usage | |
| ```python | |
| from transformers import AutoModel | |
| # Load model | |
| model = AutoModel.from_pretrained( | |
| "ShandaAI/FloodDiffusion", | |
| trust_remote_code=True | |
| ) | |
| # Generate motion from text (263-dim HumanML3D features) | |
| motion = model("a person walking forward", length=60) | |
| print(f"Generated motion: {motion.shape}") # (~240, 263) | |
| # Generate motion as joint coordinates (22 joints × 3 coords) with ema (alpha: 0.0-1.0) | |
| motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5) | |
| print(f"Generated joints: {motion_joints.shape}") # (~240, 22, 3) | |
| ``` | |
| ### Batch Generation | |
| ```python | |
| # Generate multiple motions efficiently | |
| texts = [ | |
| "a person walking forward", | |
| "a person running quickly", | |
| "a person jumping up and down" | |
| ] | |
| lengths = [60, 50, 40] # Different lengths for each motion | |
| motions = model(texts, length=lengths) | |
| for i, motion in enumerate(motions): | |
| print(f"Motion {i}: {motion.shape}") | |
| ``` | |
| ### Multi-Text Motion Transitions | |
| ```python | |
| # Generate a motion sequence with smooth transitions between actions | |
| motion = model( | |
| text=[["walk forward", "turn around", "run back"]], | |
| length=[120], | |
| text_end=[[40, 80, 120]] # Transition points in latent tokens | |
| ) | |
| # Output: ~480 frames showing all three actions smoothly connected | |
| print(f"Transition motion: {motion[0].shape}") | |
| ``` | |
| ## API Reference | |
| ### `model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)` | |
| Generate motion sequences from text descriptions. | |
| **Parameters:** | |
| - **text** (`str`, `List[str]`, or `List[List[str]]`): Text description(s) | |
| - Single string: Generate one motion | |
| - List of strings: Batch generation | |
| - Nested list: Multiple text prompts per motion (for transitions) | |
| - **length** (`int` or `List[int]`, default=60): Number of latent tokens to generate | |
| - Output frames ≈ `length × 4` (due to VAE upsampling) | |
| - Example: `length=60` → ~240 frames (~12 seconds at 20 FPS) | |
| - **text_end** (`List[int]` or `List[List[int]]`, optional): Latent token positions for text transitions | |
| - Only used when `text` is a nested list | |
| - Specifies when to switch between different text descriptions | |
| - **IMPORTANT**: Must have the same length as the corresponding text list | |
| - Example: `text=[["walk", "turn", "sit"]]` requires `text_end=[[20, 40, 60]]` (3 endpoints for 3 texts) | |
| - Must be in ascending order | |
| - **num_denoise_steps** (`int`, optional): Number of denoising iterations | |
| - Higher values produce better quality but slower generation | |
| - Recommended range: 10-50 | |
| - **output_joints** (`bool`, default=False): Output format selector | |
| - `False`: Returns 263-dimensional HumanML3D features | |
| - `True`: Returns 22×3 joint coordinates for direct visualization | |
| - **smoothing_alpha** (`float`, default=1.0): EMA smoothing factor for joint positions (only used when `output_joints=True`) | |
| - `1.0`: No smoothing (default) | |
| - `0.5`: Medium smoothing (recommended for smoother animations) | |
| - `0.0`: Maximum smoothing | |
| - Range: 0.0 to 1.0 | |
| **Returns:** | |
| - Single motion: | |
| - `output_joints=False`: `numpy.ndarray` of shape `(frames, 263)` | |
| - `output_joints=True`: `numpy.ndarray` of shape `(frames, 22, 3)` | |
| - Batch: `List[numpy.ndarray]` with shapes as above | |
| **Example:** | |
| ```python | |
| # Single generation (263-dim features) | |
| motion = model("walk forward", length=60) # Returns (240, 263) | |
| # Single generation (joint coordinates) | |
| joints = model("walk forward", length=60, output_joints=True) # Returns (240, 22, 3) | |
| # Batch generation | |
| motions = model(["walk", "run"], length=[60, 50]) # Returns list of 2 arrays | |
| # Multi-text transitions | |
| motion = model( | |
| [["walk", "turn"]], | |
| length=[60], | |
| text_end=[[30, 60]] | |
| ) # Returns list with 1 array of shape (240, 263) | |
| ``` | |
| ## Update History | |
| - **2025/12/8**: Added EMA smoothing option for joint positions during rendering | |
| ## Citation | |
| If you use this model in your research, please cite: | |
| ```bibtex | |
| @article{cai2025flooddiffusion, | |
| title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation}, | |
| author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu}, | |
| journal={arXiv preprint arXiv:2512.03520}, | |
| year={2025} | |
| } | |
| ``` | |
| ## Troubleshooting | |
| ### Common Issues | |
| **ImportError with trust_remote_code:** | |
| ```python | |
| # Solution: Add trust_remote_code=True | |
| model = AutoModel.from_pretrained( | |
| "ShandaAI/FloodDiffusion", | |
| trust_remote_code=True # Required! | |
| ) | |
| ``` | |
| **Out of Memory:** | |
| ```python | |
| # Solution: Generate shorter sequences | |
| motion = model("walk", length=30) # Shorter = less memory | |
| ``` | |
| **Slow first load:** | |
| The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant. | |
| **Module import errors:** | |
| Ensure all dependencies are installed: | |
| ```bash | |
| pip install lightning diffusers omegaconf ftfy numpy | |
| ``` |