File size: 1,654 Bytes

---
license: apache-2.0
tags:
  - robotics
  - humanoid
  - vision-language-action
  - vlam
  - diffusion-transformer
  - pose-estimation
datasets:
  - maxsegan/movenet-332
language:
  - en
---

# MIMIC: Motion Imitation from Massive Internet Clips

A 4.0B-parameter vision-language-action model for full-body humanoid control,
trained entirely from internet-scale human video.

## Model Details

- **Architecture**: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
- **Parameters**: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
- **Action space**: 22-DoF joint angles at 10Hz
- **Action horizon**: 16 steps (1.6s)
- **Training data**: [MoveNet-332](https://huggingface.co/datasets/maxsegan/movenet-332) (~332K clips, ~4.7M samples from Kinetics-700)
- **Training compute**: 4x RTX Pro Blackwell GPUs (~576 GPU-hours)
- **Checkpoint step**: 107250
- **Best validation loss**: 0.10838565230369568

## Usage

```python
from training.vla_model import QwenVLAModel
import torch, yaml

config = yaml.safe_load(open("training/config_kinetics.yaml"))
model = QwenVLAModel(**config["model_config"])

ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval().cuda()
```

See the [GitHub repo](https://github.com/maxsegan/movement) for full inference and training code.

## Training

Trained with flow matching loss on the MoveNet-332 dataset. The vision encoder
(SigLIP) is frozen throughout; the LLM backbone uses LoRA (rank 128). The DiT
action head is trained from scratch.

## Citation

Paper forthcoming.

## License

Apache 2.0