--- license: cc-by-nc-4.0 tags: - motion - clip - text-to-motion - motion-retrieval - multimodal - human-motion - motion-generation language: - en library_name: transformers pipeline_tag: feature-extraction datasets: - MotionMillion --- # MotionCLIP A Motion-Text CLIP model trained on the MotionMillion dataset for motion-text retrieval, zero-shot motion classification, and motion understanding. > ⚠️ **License Notice**: This model is released under **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. **This model is for research and non-commercial use only.** > 📋 **Body Model**: This model was trained on motion data using the **SMPL body model** (22 joints). Input motions must be in SMPL skeleton format. ## Model Description MotionCLIP learns a joint embedding space between human motion sequences and natural language descriptions. Given a motion sequence (272-dimensional features per frame) and text descriptions, the model can: - **Retrieve** the most relevant text for a motion (and vice versa) - **Classify** motions in a zero-shot manner using text labels - **Compute similarity** between motions and text descriptions ## Usage ### Installation ```bash pip install torch transformers huggingface_hub numpy ``` ### Download the Model Code Download `motion_clip_hf.py` from this repository or copy it to your project. ### Quick Start ```python from motion_clip_hf import MotionCLIP import numpy as np # Load model (auto-downloads from HuggingFace) model = MotionCLIP.from_pretrained("khania/motion-clip") # Encode text text_emb = model.encode_text(["a person walks forward", "someone is running fast"]) print(f"Text embeddings: {text_emb.shape}") # (2, 512) # Encode motion (272-dim absolute root format, variable length) motion = np.random.randn(120, 272).astype(np.float32) # Replace with real motion motion_emb = model.encode_motion(motion) print(f"Motion embedding: {motion_emb.shape}") # (512,) # Compute similarity similarity = model.compute_similarity(motion, ["walking", "running", "jumping", "sitting"]) predicted = ["walking", "running", "jumping", "sitting"][similarity.argmax()] print(f"Predicted action: {predicted}") ``` ### Text-to-Motion Retrieval ```python # Find most similar motions for a text query results = model.retrieve_motion( text="a person waves their hand", candidate_motions=[motion1, motion2, motion3], # List of (T, 272) arrays top_k=3 ) for r in results: print(f"#{r['rank']}: Motion {r['index']} (score: {r['score']:.4f})") ``` ### Motion-to-Text Retrieval ```python # Find most similar texts for a motion results = model.retrieve_text( motion=my_motion, # (T, 272) array candidate_texts=["walking", "running", "jumping", "waving", "sitting"], top_k=3 ) for r in results: print(f"#{r['rank']}: {r['text']} (score: {r['score']:.4f})") ``` ### Zero-Shot Motion Classification ```python # Define action categories actions = ["walking", "running", "jumping", "sitting", "waving", "kicking", "punching", "dancing", "stretching", "bowing"] # Classify a motion similarity = model.compute_similarity(motion, actions) predicted_action = actions[similarity.argmax()] confidence = similarity.max() print(f"Predicted: {predicted_action} (confidence: {confidence:.3f})") ``` ## Model Architecture | Component | Details | |-----------|---------| | **Motion Encoder** | 8-layer Transformer | | **Hidden Dimension** | 768 | | **Attention Heads** | 12 | | **Text Encoder** | CLIP ViT-B/32 (fine-tuned) | | **Embedding Dimension** | 512 | | **Max Sequence Length** | 1024 frames | ## Motion Format The model expects **272-dimensional motion features in absolute root format** based on the **SMPL body model** (22 joints). ### SMPL Body Model Requirement This model was trained exclusively on motion data represented using the [SMPL body model](https://smpl.is.tue.mpg.de/). Your input motions must: - Use the **SMPL skeleton** with 22 joints - Follow the SMPL joint ordering - Be converted to the 272-dimensional HumanML3D-style representation If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model. ### Feature Dimensions | Dimensions | Description | |------------|-------------| | `[0:2]` | Root XZ velocities | | `[2:8]` | Absolute heading rotation (6D representation) | | `[8:74]` | Local joint positions (22 joints × 3) | | `[74:140]` | Local joint velocities (22 joints × 3) | | `[140:272]` | Joint rotations in 6D (22 joints × 6) | The model automatically normalizes input motions using the bundled mean/std statistics. ## Training Details | Parameter | Value | |-----------|-------| | **Dataset** | MotionMillion (~884K training motions) | | **Batch Size** | 256 | | **Training Iterations** | 100,000 | | **Learning Rate (Motion Encoder)** | 1e-4 | | **Learning Rate (Text Encoder)** | 5e-5 | | **Loss Function** | Symmetric InfoNCE | | **Temperature** | Learnable (initialized at 0.07) | ## Performance Retrieval performance (R@k) on random test subsets: | Subset Size | Motion→Text R@1 | Motion→Text R@5 | Text→Motion R@1 | Text→Motion R@5 | |-------------|-----------------|-----------------|-----------------|-----------------| | 1,000 | 36.2% | 67.8% | 36.4% | 68.1% | | 5,000 | 17.7% | 42.1% | 17.8% | 42.3% | | 10,000 | 12.4% | 31.5% | 12.5% | 31.6% | *Note: Lower R@k on larger subsets is expected as the retrieval task becomes harder.* ## Files in This Repository | File | Size | Description | |------|------|-------------| | `config.json` | 239 B | Model configuration | | `pytorch_model.bin` | 219 MB | Model weights | | `mean.npy` | 1.2 KB | Motion normalization mean (272,) | | `std.npy` | 1.2 KB | Motion normalization std (272,) | ## Limitations - Trained on English text descriptions only - Motion format is specific to HumanML3D-style 272-dim representation - Best performance on motions similar to training distribution (daily activities, sports, etc.) ## Citation ```bibtex @article{motionmillion2026, title={MotionMillion: A Large-Scale Motion-Language Dataset}, author={...}, year={2026} } ``` ## License **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International) This model is released for **research and non-commercial use only**. ### Why Non-Commercial? The MotionMillion training dataset aggregates motion data from multiple sources with varying licenses: - Some datasets permit commercial use - Some datasets restrict commercial use (e.g., AMASS, BABEL, certain MoCap databases) To comply with the most restrictive terms, this model is released under CC BY-NC 4.0. ### What This Means ✅ **Allowed:** - Academic research - Personal projects - Non-commercial applications - Sharing and adapting with attribution ❌ **Not Allowed:** - Commercial products or services - Selling access to the model - Using in revenue-generating applications For commercial licensing inquiries, please contact the authors.