MIMIC: Motion Imitation from Massive Internet Clips

A 4.0B-parameter vision-language-action model for full-body humanoid control, trained entirely from internet-scale human video.

Model Details

  • Architecture: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
  • Parameters: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
  • Action space: 22-DoF joint angles at 10Hz
  • Action horizon: 16 steps (1.6s)
  • Training data: MoveNet-332 (~332K clips, ~4.7M samples from Kinetics-700)
  • Training compute: 4x RTX Pro Blackwell GPUs (~576 GPU-hours)
  • Checkpoint step: 63000
  • Best validation loss: 0.11178277047351003

Usage

from training.vla_model import QwenVLAModel
import torch, yaml

config = yaml.safe_load(open("training/config_kinetics.yaml"))
model = QwenVLAModel(**config["model_config"])

ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval().cuda()

See the GitHub repo for full inference and training code.

Training

Trained with flow matching loss on the MoveNet-332 dataset. The vision encoder (SigLIP) is frozen throughout; the LLM backbone uses LoRA (rank 128). The DiT action head is trained from scratch.

Citation

Paper forthcoming.

License

Apache 2.0

Downloads last month
11
Video Preview
loading

Dataset used to train maxsegan/mimic-vlam