mimic-vlam / README.md
maxsegan's picture
Upload README.md with huggingface_hub
94db075 verified
metadata
license: apache-2.0
tags:
  - robotics
  - humanoid
  - vision-language-action
  - vlam
  - diffusion-transformer
  - pose-estimation
datasets:
  - maxsegan/movenet-332
language:
  - en

MIMIC: Motion Imitation from Massive Internet Clips

A 4.0B-parameter vision-language-action model for full-body humanoid control, trained entirely from internet-scale human video.

Model Details

  • Architecture: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
  • Parameters: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
  • Action space: 22-DoF joint angles at 10Hz
  • Action horizon: 16 steps (1.6s)
  • Training data: MoveNet-332 (~332K clips, ~4.7M samples from Kinetics-700)
  • Training compute: 4x RTX Pro Blackwell GPUs (~576 GPU-hours)
  • Checkpoint step: 63000
  • Best validation loss: 0.11178277047351003

Usage

from training.vla_model import QwenVLAModel
import torch, yaml

config = yaml.safe_load(open("training/config_kinetics.yaml"))
model = QwenVLAModel(**config["model_config"])

ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval().cuda()

See the GitHub repo for full inference and training code.

Training

Trained with flow matching loss on the MoveNet-332 dataset. The vision encoder (SigLIP) is frozen throughout; the LLM backbone uses LoRA (rank 128). The DiT action head is trained from scratch.

Citation

Paper forthcoming.

License

Apache 2.0