mimic-vlam / README.md
maxsegan's picture
Upload README.md with huggingface_hub
626e3a1 verified
---
license: apache-2.0
tags:
- robotics
- humanoid
- vision-language-action
- vlam
- diffusion-transformer
- pose-estimation
datasets:
- maxsegan/movenet-332
language:
- en
---
# MIMIC: Motion Imitation from Massive Internet Clips
A 4.0B-parameter vision-language-action model for full-body humanoid control,
trained entirely from internet-scale human video.
## Model Details
- **Architecture**: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
- **Parameters**: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
- **Action space**: 22-DoF joint angles at 10Hz
- **Action horizon**: 16 steps (1.6s)
- **Training data**: [MoveNet-332](https://huggingface.co/datasets/maxsegan/movenet-332) (~332K clips, ~4.7M samples from Kinetics-700)
- **Training compute**: 4x RTX Pro Blackwell GPUs (~576 GPU-hours)
- **Checkpoint step**: 107250
- **Best validation loss**: 0.10838565230369568
## Usage
```python
from training.vla_model import QwenVLAModel
import torch, yaml
config = yaml.safe_load(open("training/config_kinetics.yaml"))
model = QwenVLAModel(**config["model_config"])
ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval().cuda()
```
See the [GitHub repo](https://github.com/maxsegan/movement) for full inference and training code.
## Training
Trained with flow matching loss on the MoveNet-332 dataset. The vision encoder
(SigLIP) is frozen throughout; the LLM backbone uses LoRA (rank 128). The DiT
action head is trained from scratch.
## Citation
Paper forthcoming.
## License
Apache 2.0