--- license: apache-2.0 tags: - text-to-motion - motion-generation - diffusion-forcing - humanml3d - computer-animation library_name: transformers pipeline_tag: other --- # FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
**A TINY version of the original FloodDiffusion** [Paper](https://arxiv.org/abs/2512.03520) | [Github](https://github.com/ShandaAI/FloodDiffusion) | [Project Page](https://shandaai.github.io/FloodDiffusion/)
## Installation ### Prerequisites - Python 3.8+ - CUDA-capable GPU with 16GB+ VRAM (recommended) - 16GB+ system RAM ### Dependencies **Step 1: Install basic dependencies** ```bash pip install torch transformers huggingface_hub pip install lightning diffusers omegaconf ftfy numpy ``` **Step 2: Install Flash Attention (Required)** Flash attention requires CUDA and may need compilation. Choose the appropriate method: ```bash pip install flash-attn --no-build-isolation ``` **Note:** Flash attention is **required** for this model. If installation fails, please refer to the [official flash-attention installation guide](https://github.com/Dao-AILab/flash-attention#installation-and-features). ## Quick Start ### Basic Usage ```python from transformers import AutoModel # Load model model = AutoModel.from_pretrained( "ShandaAI/FloodDiffusionTiny", trust_remote_code=True ) # Generate motion from text (263-dim HumanML3D features) motion = model("a person walking forward", length=60) print(f"Generated motion: {motion.shape}") # (~240, 263) # Generate motion as joint coordinates (22 joints × 3 coords) with ema (alpha: 0.0-1.0) motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5) print(f"Generated joints: {motion_joints.shape}") # (~240, 22, 3) ``` ### Batch Generation ```python # Generate multiple motions efficiently texts = [ "a person walking forward", "a person running quickly", "a person jumping up and down" ] lengths = [60, 50, 40] # Different lengths for each motion motions = model(texts, length=lengths) for i, motion in enumerate(motions): print(f"Motion {i}: {motion.shape}") ``` ### Multi-Text Motion Transitions ```python # Generate a motion sequence with smooth transitions between actions motion = model( text=[["walk forward", "turn around", "run back"]], length=[120], text_end=[[40, 80, 120]] # Transition points in latent tokens ) # Output: ~480 frames showing all three actions smoothly connected print(f"Transition motion: {motion[0].shape}") ``` ## API Reference ### `model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)` Generate motion sequences from text descriptions. **Parameters:** - **text** (`str`, `List[str]`, or `List[List[str]]`): Text description(s) - Single string: Generate one motion - List of strings: Batch generation - Nested list: Multiple text prompts per motion (for transitions) - **length** (`int` or `List[int]`, default=60): Number of latent tokens to generate - Output frames ≈ `length × 4` (due to VAE upsampling) - Example: `length=60` → ~240 frames (~12 seconds at 20 FPS) - **text_end** (`List[int]` or `List[List[int]]`, optional): Latent token positions for text transitions - Only used when `text` is a nested list - Specifies when to switch between different text descriptions - **IMPORTANT**: Must have the same length as the corresponding text list - Example: `text=[["walk", "turn", "sit"]]` requires `text_end=[[20, 40, 60]]` (3 endpoints for 3 texts) - Must be in ascending order - **num_denoise_steps** (`int`, optional): Number of denoising iterations - Higher values produce better quality but slower generation - Recommended range: 10-50 - **output_joints** (`bool`, default=False): Output format selector - `False`: Returns 263-dimensional HumanML3D features - `True`: Returns 22×3 joint coordinates for direct visualization - **smoothing_alpha** (`float`, default=1.0): EMA smoothing factor for joint positions (only used when `output_joints=True`) - `1.0`: No smoothing (default) - `0.5`: Medium smoothing (recommended for smoother animations) - `0.0`: Maximum smoothing - Range: 0.0 to 1.0 **Returns:** - Single motion: - `output_joints=False`: `numpy.ndarray` of shape `(frames, 263)` - `output_joints=True`: `numpy.ndarray` of shape `(frames, 22, 3)` - Batch: `List[numpy.ndarray]` with shapes as above **Example:** ```python # Single generation (263-dim features) motion = model("walk forward", length=60) # Returns (240, 263) # Single generation (joint coordinates) joints = model("walk forward", length=60, output_joints=True) # Returns (240, 22, 3) # Batch generation motions = model(["walk", "run"], length=[60, 50]) # Returns list of 2 arrays # Multi-text transitions motion = model( [["walk", "turn"]], length=[60], text_end=[[30, 60]] ) # Returns list with 1 array of shape (240, 263) ``` ## Citation If you use this model in your research, please cite: ```bibtex @article{cai2025flooddiffusion, title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation}, author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu}, journal={arXiv preprint arXiv:2512.03520}, year={2025} } ```