--- license: apache-2.0 tags: - robotics - humanoid - vision-language-action - vlam - diffusion-transformer - pose-estimation datasets: - maxsegan/movenet-332 language: - en --- # MIMIC: Motion Imitation from Massive Internet Clips A 4.0B-parameter vision-language-action model for full-body humanoid control, trained entirely from internet-scale human video. ## Model Details - **Architecture**: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head - **Parameters**: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA) - **Action space**: 22-DoF joint angles at 10Hz - **Action horizon**: 16 steps (1.6s) - **Training data**: [MoveNet-332](https://huggingface.co/datasets/maxsegan/movenet-332) (~332K clips, ~4.7M samples from Kinetics-700) - **Training compute**: 4x RTX Pro Blackwell GPUs (~576 GPU-hours) - **Checkpoint step**: 107250 - **Best validation loss**: 0.10838565230369568 ## Usage ```python from training.vla_model import QwenVLAModel import torch, yaml config = yaml.safe_load(open("training/config_kinetics.yaml")) model = QwenVLAModel(**config["model_config"]) ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False) model.load_state_dict(ckpt["model_state_dict"]) model.eval().cuda() ``` See the [GitHub repo](https://github.com/maxsegan/movement) for full inference and training code. ## Training Trained with flow matching loss on the MoveNet-332 dataset. The vision encoder (SigLIP) is frozen throughout; the LLM backbone uses LoRA (rank 128). The DiT action head is trained from scratch. ## Citation Paper forthcoming. ## License Apache 2.0