motion-clip / README.md
khania's picture
MotionCLIP model update
873cf13 verified
---
license: cc-by-nc-4.0
tags:
- motion
- clip
- text-to-motion
- motion-retrieval
- multimodal
- human-motion
- motion-generation
language:
- en
library_name: transformers
pipeline_tag: feature-extraction
datasets:
- MotionMillion
---
# MotionCLIP
A Motion-Text CLIP model trained on the MotionMillion dataset for motion-text retrieval, zero-shot motion classification, and motion understanding.
> ⚠️ **License Notice**: This model is released under **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. **This model is for research and non-commercial use only.**
> 📋 **Body Model**: This model was trained on motion data using the **SMPL body model** (22 joints). Input motions must be in SMPL skeleton format.
## Model Description
MotionCLIP learns a joint embedding space between human motion sequences and natural language descriptions. Given a motion sequence (272-dimensional features per frame) and text descriptions, the model can:
- **Retrieve** the most relevant text for a motion (and vice versa)
- **Classify** motions in a zero-shot manner using text labels
- **Compute similarity** between motions and text descriptions
## Usage
### Installation
```bash
pip install torch transformers huggingface_hub numpy
```
### Download the Model Code
Download `motion_clip_hf.py` from this repository or copy it to your project.
### Quick Start
```python
from motion_clip_hf import MotionCLIP
import numpy as np
# Load model (auto-downloads from HuggingFace)
model = MotionCLIP.from_pretrained("khania/motion-clip")
# Encode text
text_emb = model.encode_text(["a person walks forward", "someone is running fast"])
print(f"Text embeddings: {text_emb.shape}") # (2, 512)
# Encode motion (272-dim absolute root format, variable length)
motion = np.random.randn(120, 272).astype(np.float32) # Replace with real motion
motion_emb = model.encode_motion(motion)
print(f"Motion embedding: {motion_emb.shape}") # (512,)
# Compute similarity
similarity = model.compute_similarity(motion, ["walking", "running", "jumping", "sitting"])
predicted = ["walking", "running", "jumping", "sitting"][similarity.argmax()]
print(f"Predicted action: {predicted}")
```
### Text-to-Motion Retrieval
```python
# Find most similar motions for a text query
results = model.retrieve_motion(
text="a person waves their hand",
candidate_motions=[motion1, motion2, motion3], # List of (T, 272) arrays
top_k=3
)
for r in results:
print(f"#{r['rank']}: Motion {r['index']} (score: {r['score']:.4f})")
```
### Motion-to-Text Retrieval
```python
# Find most similar texts for a motion
results = model.retrieve_text(
motion=my_motion, # (T, 272) array
candidate_texts=["walking", "running", "jumping", "waving", "sitting"],
top_k=3
)
for r in results:
print(f"#{r['rank']}: {r['text']} (score: {r['score']:.4f})")
```
### Zero-Shot Motion Classification
```python
# Define action categories
actions = ["walking", "running", "jumping", "sitting", "waving",
"kicking", "punching", "dancing", "stretching", "bowing"]
# Classify a motion
similarity = model.compute_similarity(motion, actions)
predicted_action = actions[similarity.argmax()]
confidence = similarity.max()
print(f"Predicted: {predicted_action} (confidence: {confidence:.3f})")
```
## Model Architecture
| Component | Details |
|-----------|---------|
| **Motion Encoder** | 8-layer Transformer |
| **Hidden Dimension** | 768 |
| **Attention Heads** | 12 |
| **Text Encoder** | CLIP ViT-B/32 (fine-tuned) |
| **Embedding Dimension** | 512 |
| **Max Sequence Length** | 1024 frames |
## Motion Format
The model expects **272-dimensional motion features in absolute root format** based on the **SMPL body model** (22 joints).
### SMPL Body Model Requirement
This model was trained exclusively on motion data represented using the [SMPL body model](https://smpl.is.tue.mpg.de/). Your input motions must:
- Use the **SMPL skeleton** with 22 joints
- Follow the SMPL joint ordering
- Be converted to the 272-dimensional HumanML3D-style representation
If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.
### Feature Dimensions
| Dimensions | Description |
|------------|-------------|
| `[0:2]` | Root XZ velocities |
| `[2:8]` | Absolute heading rotation (6D representation) |
| `[8:74]` | Local joint positions (22 joints × 3) |
| `[74:140]` | Local joint velocities (22 joints × 3) |
| `[140:272]` | Joint rotations in 6D (22 joints × 6) |
The model automatically normalizes input motions using the bundled mean/std statistics.
## Training Details
| Parameter | Value |
|-----------|-------|
| **Dataset** | MotionMillion (~884K training motions) |
| **Batch Size** | 256 |
| **Training Iterations** | 100,000 |
| **Learning Rate (Motion Encoder)** | 1e-4 |
| **Learning Rate (Text Encoder)** | 5e-5 |
| **Loss Function** | Symmetric InfoNCE |
| **Temperature** | Learnable (initialized at 0.07) |
## Performance
Retrieval performance (R@k) on random test subsets:
| Subset Size | Motion→Text R@1 | Motion→Text R@5 | Text→Motion R@1 | Text→Motion R@5 |
|-------------|-----------------|-----------------|-----------------|-----------------|
| 1,000 | 36.2% | 67.8% | 36.4% | 68.1% |
| 5,000 | 17.7% | 42.1% | 17.8% | 42.3% |
| 10,000 | 12.4% | 31.5% | 12.5% | 31.6% |
*Note: Lower R@k on larger subsets is expected as the retrieval task becomes harder.*
## Files in This Repository
| File | Size | Description |
|------|------|-------------|
| `config.json` | 239 B | Model configuration |
| `pytorch_model.bin` | 219 MB | Model weights |
| `mean.npy` | 1.2 KB | Motion normalization mean (272,) |
| `std.npy` | 1.2 KB | Motion normalization std (272,) |
## Limitations
- Trained on English text descriptions only
- Motion format is specific to HumanML3D-style 272-dim representation
- Best performance on motions similar to training distribution (daily activities, sports, etc.)
## Citation
```bibtex
@article{motionmillion2026,
title={MotionMillion: A Large-Scale Motion-Language Dataset},
author={...},
year={2026}
}
```
## License
**CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International)
This model is released for **research and non-commercial use only**.
### Why Non-Commercial?
The MotionMillion training dataset aggregates motion data from multiple sources with varying licenses:
- Some datasets permit commercial use
- Some datasets restrict commercial use (e.g., AMASS, BABEL, certain MoCap databases)
To comply with the most restrictive terms, this model is released under CC BY-NC 4.0.
### What This Means
**Allowed:**
- Academic research
- Personal projects
- Non-commercial applications
- Sharing and adapting with attribution
**Not Allowed:**
- Commercial products or services
- Selling access to the model
- Using in revenue-generating applications
For commercial licensing inquiries, please contact the authors.