maxsegan
/

mimic-vlam

vision-language-action

diffusion-transformer

pose-estimation

Model card Files Files and versions

mimic-vlam / README.md

maxsegan's picture

Upload README.md with huggingface_hub

626e3a1 verified 29 days ago

|

history blame contribute delete

1.65 kB

	---
	license: apache-2.0
	tags:
	- robotics
	- humanoid
	- vision-language-action
	- vlam
	- diffusion-transformer
	- pose-estimation
	datasets:
	- maxsegan/movenet-332
	language:
	- en
	---

	# MIMIC: Motion Imitation from Massive Internet Clips

	A 4.0B-parameter vision-language-action model for full-body humanoid control,
	trained entirely from internet-scale human video.

	## Model Details

	- Architecture: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
	- Parameters: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
	- Action space: 22-DoF joint angles at 10Hz
	- Action horizon: 16 steps (1.6s)
	- Training data: [MoveNet-332](https://huggingface.co/datasets/maxsegan/movenet-332) (~332K clips, ~4.7M samples from Kinetics-700)
	- Training compute: 4x RTX Pro Blackwell GPUs (~576 GPU-hours)
	- Checkpoint step: 107250
	- Best validation loss: 0.10838565230369568

	## Usage

	```python
	from training.vla_model import QwenVLAModel
	import torch, yaml

	config = yaml.safe_load(open("training/config_kinetics.yaml"))
	model = QwenVLAModel(**config["model_config"])

	ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
	model.load_state_dict(ckpt["model_state_dict"])
	model.eval().cuda()
	```

	See the [GitHub repo](https://github.com/maxsegan/movement) for full inference and training code.

	## Training

	Trained with flow matching loss on the MoveNet-332 dataset. The vision encoder
	(SigLIP) is frozen throughout; the LLM backbone uses LoRA (rank 128). The DiT
	action head is trained from scratch.

	## Citation

	Paper forthcoming.

	## License

	Apache 2.0