---
license: cc-by-nc-4.0
tags:
  - motion
  - clip
  - text-to-motion
  - motion-retrieval
  - multimodal
  - human-motion
  - motion-generation
language:
  - en
library_name: transformers
pipeline_tag: feature-extraction
datasets:
  - MotionMillion
---

# MotionCLIP

A Motion-Text CLIP model trained on the MotionMillion dataset for motion-text retrieval, zero-shot motion classification, and motion understanding.

> ⚠️ **License Notice**: This model is released under **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. **This model is for research and non-commercial use only.**

> 📋 **Body Model**: This model was trained on motion data using the **SMPL body model** (22 joints). Input motions must be in SMPL skeleton format.

## Model Description

MotionCLIP learns a joint embedding space between human motion sequences and natural language descriptions. Given a motion sequence (272-dimensional features per frame) and text descriptions, the model can:

- **Retrieve** the most relevant text for a motion (and vice versa)
- **Classify** motions in a zero-shot manner using text labels
- **Compute similarity** between motions and text descriptions

## Usage

### Installation

```bash
pip install torch transformers huggingface_hub numpy
```

### Download the Model Code

Download `motion_clip_hf.py` from this repository or copy it to your project.

### Quick Start

```python
from motion_clip_hf import MotionCLIP
import numpy as np

# Load model (auto-downloads from HuggingFace)
model = MotionCLIP.from_pretrained("khania/motion-clip")

# Encode text
text_emb = model.encode_text(["a person walks forward", "someone is running fast"])
print(f"Text embeddings: {text_emb.shape}")  # (2, 512)

# Encode motion (272-dim absolute root format, variable length)
motion = np.random.randn(120, 272).astype(np.float32)  # Replace with real motion
motion_emb = model.encode_motion(motion)
print(f"Motion embedding: {motion_emb.shape}")  # (512,)

# Compute similarity
similarity = model.compute_similarity(motion, ["walking", "running", "jumping", "sitting"])
predicted = ["walking", "running", "jumping", "sitting"][similarity.argmax()]
print(f"Predicted action: {predicted}")
```

### Text-to-Motion Retrieval

```python
# Find most similar motions for a text query
results = model.retrieve_motion(
    text="a person waves their hand",
    candidate_motions=[motion1, motion2, motion3],  # List of (T, 272) arrays
    top_k=3
)
for r in results:
    print(f"#{r['rank']}: Motion {r['index']} (score: {r['score']:.4f})")
```

### Motion-to-Text Retrieval

```python
# Find most similar texts for a motion
results = model.retrieve_text(
    motion=my_motion,  # (T, 272) array
    candidate_texts=["walking", "running", "jumping", "waving", "sitting"],
    top_k=3
)
for r in results:
    print(f"#{r['rank']}: {r['text']} (score: {r['score']:.4f})")
```

### Zero-Shot Motion Classification

```python
# Define action categories
actions = ["walking", "running", "jumping", "sitting", "waving", 
           "kicking", "punching", "dancing", "stretching", "bowing"]

# Classify a motion
similarity = model.compute_similarity(motion, actions)
predicted_action = actions[similarity.argmax()]
confidence = similarity.max()
print(f"Predicted: {predicted_action} (confidence: {confidence:.3f})")
```

## Model Architecture

| Component | Details |
|-----------|---------|
| **Motion Encoder** | 8-layer Transformer |
| **Hidden Dimension** | 768 |
| **Attention Heads** | 12 |
| **Text Encoder** | CLIP ViT-B/32 (fine-tuned) |
| **Embedding Dimension** | 512 |
| **Max Sequence Length** | 1024 frames |

## Motion Format

The model expects **272-dimensional motion features in absolute root format** based on the **SMPL body model** (22 joints).

### SMPL Body Model Requirement

This model was trained exclusively on motion data represented using the [SMPL body model](https://smpl.is.tue.mpg.de/). Your input motions must:

- Use the **SMPL skeleton** with 22 joints
- Follow the SMPL joint ordering
- Be converted to the 272-dimensional HumanML3D-style representation

If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.

### Feature Dimensions

| Dimensions | Description |
|------------|-------------|
| `[0:2]` | Root XZ velocities |
| `[2:8]` | Absolute heading rotation (6D representation) |
| `[8:74]` | Local joint positions (22 joints × 3) |
| `[74:140]` | Local joint velocities (22 joints × 3) |
| `[140:272]` | Joint rotations in 6D (22 joints × 6) |

The model automatically normalizes input motions using the bundled mean/std statistics.

## Training Details

| Parameter | Value |
|-----------|-------|
| **Dataset** | MotionMillion (~884K training motions) |
| **Batch Size** | 256 |
| **Training Iterations** | 100,000 |
| **Learning Rate (Motion Encoder)** | 1e-4 |
| **Learning Rate (Text Encoder)** | 5e-5 |
| **Loss Function** | Symmetric InfoNCE |
| **Temperature** | Learnable (initialized at 0.07) |

## Performance

Retrieval performance (R@k) on random test subsets:

| Subset Size | Motion→Text R@1 | Motion→Text R@5 | Text→Motion R@1 | Text→Motion R@5 |
|-------------|-----------------|-----------------|-----------------|-----------------|
| 1,000 | 36.2% | 67.8% | 36.4% | 68.1% |
| 5,000 | 17.7% | 42.1% | 17.8% | 42.3% |
| 10,000 | 12.4% | 31.5% | 12.5% | 31.6% |

*Note: Lower R@k on larger subsets is expected as the retrieval task becomes harder.*

## Files in This Repository

| File | Size | Description |
|------|------|-------------|
| `config.json` | 239 B | Model configuration |
| `pytorch_model.bin` | 219 MB | Model weights |
| `mean.npy` | 1.2 KB | Motion normalization mean (272,) |
| `std.npy` | 1.2 KB | Motion normalization std (272,) |

## Limitations

- Trained on English text descriptions only
- Motion format is specific to HumanML3D-style 272-dim representation
- Best performance on motions similar to training distribution (daily activities, sports, etc.)

## Citation

```bibtex
@article{motionmillion2026,
  title={MotionMillion: A Large-Scale Motion-Language Dataset},
  author={...},
  year={2026}
}
```

## License

**CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International)

This model is released for **research and non-commercial use only**.

### Why Non-Commercial?

The MotionMillion training dataset aggregates motion data from multiple sources with varying licenses:
- Some datasets permit commercial use
- Some datasets restrict commercial use (e.g., AMASS, BABEL, certain MoCap databases)

To comply with the most restrictive terms, this model is released under CC BY-NC 4.0.

### What This Means

✅ **Allowed:**
- Academic research
- Personal projects
- Non-commercial applications
- Sharing and adapting with attribution

❌ **Not Allowed:**
- Commercial products or services
- Selling access to the model
- Using in revenue-generating applications

For commercial licensing inquiries, please contact the authors.