MotionCLIP model update

873cf13 verified 28 days ago

7.14 kB

	---
	license: cc-by-nc-4.0
	tags:
	- motion
	- clip
	- text-to-motion
	- motion-retrieval
	- multimodal
	- human-motion
	- motion-generation
	language:
	- en
	library_name: transformers
	pipeline_tag: feature-extraction
	datasets:
	- MotionMillion
	---

	# MotionCLIP

	A Motion-Text CLIP model trained on the MotionMillion dataset for motion-text retrieval, zero-shot motion classification, and motion understanding.

	> ⚠️ License Notice: This model is released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. This model is for research and non-commercial use only.

	> 📋 Body Model: This model was trained on motion data using the SMPL body model (22 joints). Input motions must be in SMPL skeleton format.

	## Model Description

	MotionCLIP learns a joint embedding space between human motion sequences and natural language descriptions. Given a motion sequence (272-dimensional features per frame) and text descriptions, the model can:

	- Retrieve the most relevant text for a motion (and vice versa)
	- Classify motions in a zero-shot manner using text labels
	- Compute similarity between motions and text descriptions

	## Usage

	### Installation

	```bash
	pip install torch transformers huggingface_hub numpy
	```

	### Download the Model Code

	Download `motion_clip_hf.py` from this repository or copy it to your project.

	### Quick Start

	```python
	from motion_clip_hf import MotionCLIP
	import numpy as np

	# Load model (auto-downloads from HuggingFace)
	model = MotionCLIP.from_pretrained("khania/motion-clip")

	# Encode text
	text_emb = model.encode_text(["a person walks forward", "someone is running fast"])
	print(f"Text embeddings: {text_emb.shape}") # (2, 512)

	# Encode motion (272-dim absolute root format, variable length)
	motion = np.random.randn(120, 272).astype(np.float32) # Replace with real motion
	motion_emb = model.encode_motion(motion)
	print(f"Motion embedding: {motion_emb.shape}") # (512,)

	# Compute similarity
	similarity = model.compute_similarity(motion, ["walking", "running", "jumping", "sitting"])
	predicted = ["walking", "running", "jumping", "sitting"][similarity.argmax()]
	print(f"Predicted action: {predicted}")
	```

	### Text-to-Motion Retrieval

	```python
	# Find most similar motions for a text query
	results = model.retrieve_motion(
	text="a person waves their hand",
	candidate_motions=[motion1, motion2, motion3], # List of (T, 272) arrays
	top_k=3
	)
	for r in results:
	print(f"#{r['rank']}: Motion {r['index']} (score: {r['score']:.4f})")
	```

	### Motion-to-Text Retrieval

	```python
	# Find most similar texts for a motion
	results = model.retrieve_text(
	motion=my_motion, # (T, 272) array
	candidate_texts=["walking", "running", "jumping", "waving", "sitting"],
	top_k=3
	)
	for r in results:
	print(f"#{r['rank']}: {r['text']} (score: {r['score']:.4f})")
	```

	### Zero-Shot Motion Classification

	```python
	# Define action categories
	actions = ["walking", "running", "jumping", "sitting", "waving",
	"kicking", "punching", "dancing", "stretching", "bowing"]

	# Classify a motion
	similarity = model.compute_similarity(motion, actions)
	predicted_action = actions[similarity.argmax()]
	confidence = similarity.max()
	print(f"Predicted: {predicted_action} (confidence: {confidence:.3f})")
	```

	## Model Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Motion Encoder \| 8-layer Transformer \|
	\| Hidden Dimension \| 768 \|
	\| Attention Heads \| 12 \|
	\| Text Encoder \| CLIP ViT-B/32 (fine-tuned) \|
	\| Embedding Dimension \| 512 \|
	\| Max Sequence Length \| 1024 frames \|

	## Motion Format

	The model expects 272-dimensional motion features in absolute root format based on the SMPL body model (22 joints).

	### SMPL Body Model Requirement

	This model was trained exclusively on motion data represented using the [SMPL body model](https://smpl.is.tue.mpg.de/). Your input motions must:

	- Use the SMPL skeleton with 22 joints
	- Follow the SMPL joint ordering
	- Be converted to the 272-dimensional HumanML3D-style representation

	If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.

	### Feature Dimensions

	\| Dimensions \| Description \|
	\|------------\|-------------\|
	\| `[0:2]` \| Root XZ velocities \|
	\| `[2:8]` \| Absolute heading rotation (6D representation) \|
	\| `[8:74]` \| Local joint positions (22 joints × 3) \|
	\| `[74:140]` \| Local joint velocities (22 joints × 3) \|
	\| `[140:272]` \| Joint rotations in 6D (22 joints × 6) \|

	The model automatically normalizes input motions using the bundled mean/std statistics.

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| MotionMillion (~884K training motions) \|
	\| Batch Size \| 256 \|
	\| Training Iterations \| 100,000 \|
	\| Learning Rate (Motion Encoder) \| 1e-4 \|
	\| Learning Rate (Text Encoder) \| 5e-5 \|
	\| Loss Function \| Symmetric InfoNCE \|
	\| Temperature \| Learnable (initialized at 0.07) \|

	## Performance

	Retrieval performance (R@k) on random test subsets:

	\| Subset Size \| Motion→Text R@1 \| Motion→Text R@5 \| Text→Motion R@1 \| Text→Motion R@5 \|
	\|-------------\|-----------------\|-----------------\|-----------------\|-----------------\|
	\| 1,000 \| 36.2% \| 67.8% \| 36.4% \| 68.1% \|
	\| 5,000 \| 17.7% \| 42.1% \| 17.8% \| 42.3% \|
	\| 10,000 \| 12.4% \| 31.5% \| 12.5% \| 31.6% \|

	Note: Lower R@k on larger subsets is expected as the retrieval task becomes harder.

	## Files in This Repository

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `config.json` \| 239 B \| Model configuration \|
	\| `pytorch_model.bin` \| 219 MB \| Model weights \|
	\| `mean.npy` \| 1.2 KB \| Motion normalization mean (272,) \|
	\| `std.npy` \| 1.2 KB \| Motion normalization std (272,) \|

	## Limitations

	- Trained on English text descriptions only
	- Motion format is specific to HumanML3D-style 272-dim representation
	- Best performance on motions similar to training distribution (daily activities, sports, etc.)

	## Citation

	```bibtex
	@article{motionmillion2026,
	title={MotionMillion: A Large-Scale Motion-Language Dataset},
	author={...},
	year={2026}
	}
	```

	## License

	CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)

	This model is released for research and non-commercial use only.

	### Why Non-Commercial?

	The MotionMillion training dataset aggregates motion data from multiple sources with varying licenses:
	- Some datasets permit commercial use
	- Some datasets restrict commercial use (e.g., AMASS, BABEL, certain MoCap databases)

	To comply with the most restrictive terms, this model is released under CC BY-NC 4.0.

	### What This Means

	✅ Allowed:
	- Academic research
	- Personal projects
	- Non-commercial applications
	- Sharing and adapting with attribution

	❌ Not Allowed:
	- Commercial products or services
	- Selling access to the model
	- Using in revenue-generating applications

	For commercial licensing inquiries, please contact the authors.