| --- |
| license: apache-2.0 |
| tags: |
| - robotics |
| - humanoid |
| - vision-language-action |
| - vlam |
| - diffusion-transformer |
| - pose-estimation |
| datasets: |
| - maxsegan/movenet-332 |
| language: |
| - en |
| --- |
| |
| # MIMIC: Motion Imitation from Massive Internet Clips |
|
|
| A 4.0B-parameter vision-language-action model for full-body humanoid control, |
| trained entirely from internet-scale human video. |
|
|
| ## Model Details |
|
|
| - **Architecture**: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head |
| - **Parameters**: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA) |
| - **Action space**: 22-DoF joint angles at 10Hz |
| - **Action horizon**: 16 steps (1.6s) |
| - **Training data**: [MoveNet-332](https://huggingface.co/datasets/maxsegan/movenet-332) (~332K clips, ~4.7M samples from Kinetics-700) |
| - **Training compute**: 4x RTX Pro Blackwell GPUs (~576 GPU-hours) |
| - **Checkpoint step**: 107250 |
| - **Best validation loss**: 0.10838565230369568 |
|
|
| ## Usage |
|
|
| ```python |
| from training.vla_model import QwenVLAModel |
| import torch, yaml |
| |
| config = yaml.safe_load(open("training/config_kinetics.yaml")) |
| model = QwenVLAModel(**config["model_config"]) |
| |
| ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False) |
| model.load_state_dict(ckpt["model_state_dict"]) |
| model.eval().cuda() |
| ``` |
|
|
| See the [GitHub repo](https://github.com/maxsegan/movement) for full inference and training code. |
|
|
| ## Training |
|
|
| Trained with flow matching loss on the MoveNet-332 dataset. The vision encoder |
| (SigLIP) is frozen throughout; the LLM backbone uses LoRA (rank 128). The DiT |
| action head is trained from scratch. |
|
|
| ## Citation |
|
|
| Paper forthcoming. |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|