MIMIC: Motion Imitation from Massive Internet Clips
A 4.0B-parameter vision-language-action model for full-body humanoid control, trained entirely from internet-scale human video.
Model Details
- Architecture: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
- Parameters: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
- Action space: 22-DoF joint angles at 10Hz
- Action horizon: 16 steps (1.6s)
- Training data: MoveNet-332 (~332K clips, ~4.7M samples from Kinetics-700)
- Training compute: 4x RTX Pro Blackwell GPUs (~576 GPU-hours)
- Checkpoint step: 63000
- Best validation loss: 0.11178277047351003
Usage
from training.vla_model import QwenVLAModel
import torch, yaml
config = yaml.safe_load(open("training/config_kinetics.yaml"))
model = QwenVLAModel(**config["model_config"])
ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval().cuda()
See the GitHub repo for full inference and training code.
Training
Trained with flow matching loss on the MoveNet-332 dataset. The vision encoder (SigLIP) is frozen throughout; the LLM backbone uses LoRA (rank 128). The DiT action head is trained from scratch.
Citation
Paper forthcoming.
License
Apache 2.0
- Downloads last month
- 11