| --- |
| license: cc-by-nc-4.0 |
| tags: |
| - motion |
| - clip |
| - text-to-motion |
| - motion-retrieval |
| - multimodal |
| - human-motion |
| - motion-generation |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: feature-extraction |
| datasets: |
| - MotionMillion |
| --- |
| |
| # MotionCLIP |
|
|
| A Motion-Text CLIP model trained on the MotionMillion dataset for motion-text retrieval, zero-shot motion classification, and motion understanding. |
|
|
| > ⚠️ **License Notice**: This model is released under **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. **This model is for research and non-commercial use only.** |
|
|
| > 📋 **Body Model**: This model was trained on motion data using the **SMPL body model** (22 joints). Input motions must be in SMPL skeleton format. |
|
|
| ## Model Description |
|
|
| MotionCLIP learns a joint embedding space between human motion sequences and natural language descriptions. Given a motion sequence (272-dimensional features per frame) and text descriptions, the model can: |
|
|
| - **Retrieve** the most relevant text for a motion (and vice versa) |
| - **Classify** motions in a zero-shot manner using text labels |
| - **Compute similarity** between motions and text descriptions |
|
|
| ## Usage |
|
|
| ### Installation |
|
|
| ```bash |
| pip install torch transformers huggingface_hub numpy |
| ``` |
|
|
| ### Download the Model Code |
|
|
| Download `motion_clip_hf.py` from this repository or copy it to your project. |
|
|
| ### Quick Start |
|
|
| ```python |
| from motion_clip_hf import MotionCLIP |
| import numpy as np |
| |
| # Load model (auto-downloads from HuggingFace) |
| model = MotionCLIP.from_pretrained("khania/motion-clip") |
| |
| # Encode text |
| text_emb = model.encode_text(["a person walks forward", "someone is running fast"]) |
| print(f"Text embeddings: {text_emb.shape}") # (2, 512) |
| |
| # Encode motion (272-dim absolute root format, variable length) |
| motion = np.random.randn(120, 272).astype(np.float32) # Replace with real motion |
| motion_emb = model.encode_motion(motion) |
| print(f"Motion embedding: {motion_emb.shape}") # (512,) |
| |
| # Compute similarity |
| similarity = model.compute_similarity(motion, ["walking", "running", "jumping", "sitting"]) |
| predicted = ["walking", "running", "jumping", "sitting"][similarity.argmax()] |
| print(f"Predicted action: {predicted}") |
| ``` |
|
|
| ### Text-to-Motion Retrieval |
|
|
| ```python |
| # Find most similar motions for a text query |
| results = model.retrieve_motion( |
| text="a person waves their hand", |
| candidate_motions=[motion1, motion2, motion3], # List of (T, 272) arrays |
| top_k=3 |
| ) |
| for r in results: |
| print(f"#{r['rank']}: Motion {r['index']} (score: {r['score']:.4f})") |
| ``` |
|
|
| ### Motion-to-Text Retrieval |
|
|
| ```python |
| # Find most similar texts for a motion |
| results = model.retrieve_text( |
| motion=my_motion, # (T, 272) array |
| candidate_texts=["walking", "running", "jumping", "waving", "sitting"], |
| top_k=3 |
| ) |
| for r in results: |
| print(f"#{r['rank']}: {r['text']} (score: {r['score']:.4f})") |
| ``` |
|
|
| ### Zero-Shot Motion Classification |
|
|
| ```python |
| # Define action categories |
| actions = ["walking", "running", "jumping", "sitting", "waving", |
| "kicking", "punching", "dancing", "stretching", "bowing"] |
| |
| # Classify a motion |
| similarity = model.compute_similarity(motion, actions) |
| predicted_action = actions[similarity.argmax()] |
| confidence = similarity.max() |
| print(f"Predicted: {predicted_action} (confidence: {confidence:.3f})") |
| ``` |
|
|
| ## Model Architecture |
|
|
| | Component | Details | |
| |-----------|---------| |
| | **Motion Encoder** | 8-layer Transformer | |
| | **Hidden Dimension** | 768 | |
| | **Attention Heads** | 12 | |
| | **Text Encoder** | CLIP ViT-B/32 (fine-tuned) | |
| | **Embedding Dimension** | 512 | |
| | **Max Sequence Length** | 1024 frames | |
|
|
| ## Motion Format |
|
|
| The model expects **272-dimensional motion features in absolute root format** based on the **SMPL body model** (22 joints). |
|
|
| ### SMPL Body Model Requirement |
|
|
| This model was trained exclusively on motion data represented using the [SMPL body model](https://smpl.is.tue.mpg.de/). Your input motions must: |
|
|
| - Use the **SMPL skeleton** with 22 joints |
| - Follow the SMPL joint ordering |
| - Be converted to the 272-dimensional HumanML3D-style representation |
|
|
| If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model. |
|
|
| ### Feature Dimensions |
|
|
| | Dimensions | Description | |
| |------------|-------------| |
| | `[0:2]` | Root XZ velocities | |
| | `[2:8]` | Absolute heading rotation (6D representation) | |
| | `[8:74]` | Local joint positions (22 joints × 3) | |
| | `[74:140]` | Local joint velocities (22 joints × 3) | |
| | `[140:272]` | Joint rotations in 6D (22 joints × 6) | |
|
|
| The model automatically normalizes input motions using the bundled mean/std statistics. |
|
|
| ## Training Details |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | **Dataset** | MotionMillion (~884K training motions) | |
| | **Batch Size** | 256 | |
| | **Training Iterations** | 100,000 | |
| | **Learning Rate (Motion Encoder)** | 1e-4 | |
| | **Learning Rate (Text Encoder)** | 5e-5 | |
| | **Loss Function** | Symmetric InfoNCE | |
| | **Temperature** | Learnable (initialized at 0.07) | |
|
|
| ## Performance |
|
|
| Retrieval performance (R@k) on random test subsets: |
|
|
| | Subset Size | Motion→Text R@1 | Motion→Text R@5 | Text→Motion R@1 | Text→Motion R@5 | |
| |-------------|-----------------|-----------------|-----------------|-----------------| |
| | 1,000 | 36.2% | 67.8% | 36.4% | 68.1% | |
| | 5,000 | 17.7% | 42.1% | 17.8% | 42.3% | |
| | 10,000 | 12.4% | 31.5% | 12.5% | 31.6% | |
|
|
| *Note: Lower R@k on larger subsets is expected as the retrieval task becomes harder.* |
|
|
| ## Files in This Repository |
|
|
| | File | Size | Description | |
| |------|------|-------------| |
| | `config.json` | 239 B | Model configuration | |
| | `pytorch_model.bin` | 219 MB | Model weights | |
| | `mean.npy` | 1.2 KB | Motion normalization mean (272,) | |
| | `std.npy` | 1.2 KB | Motion normalization std (272,) | |
|
|
| ## Limitations |
|
|
| - Trained on English text descriptions only |
| - Motion format is specific to HumanML3D-style 272-dim representation |
| - Best performance on motions similar to training distribution (daily activities, sports, etc.) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{motionmillion2026, |
| title={MotionMillion: A Large-Scale Motion-Language Dataset}, |
| author={...}, |
| year={2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International) |
|
|
| This model is released for **research and non-commercial use only**. |
|
|
| ### Why Non-Commercial? |
|
|
| The MotionMillion training dataset aggregates motion data from multiple sources with varying licenses: |
| - Some datasets permit commercial use |
| - Some datasets restrict commercial use (e.g., AMASS, BABEL, certain MoCap databases) |
|
|
| To comply with the most restrictive terms, this model is released under CC BY-NC 4.0. |
|
|
| ### What This Means |
|
|
| ✅ **Allowed:** |
| - Academic research |
| - Personal projects |
| - Non-commercial applications |
| - Sharing and adapting with attribution |
|
|
| ❌ **Not Allowed:** |
| - Commercial products or services |
| - Selling access to the model |
| - Using in revenue-generating applications |
|
|
| For commercial licensing inquiries, please contact the authors. |
|
|