README: put author/affiliation/email on separate lines

54b395c verified 4 days ago

4.01 kB

	---
	license: mit
	library_name: pytorch
	pipeline_tag: robotics
	tags:
	- robotics
	- vision-language-action
	- vla
	- latent-action-model
	- tokenizer
	- manipulation
	---

	# SemanticVLA · Latent Action Model (LAM)

	> 🎉 Accepted to [CVPR 2026](https://cvpr.thecvf.com/virtual/2026/poster/39352).
	> ✍️ Fei Ni¹, Zhuo Chen², Yifu Yuan³, Zibin Dong³, Xianze Yao³, Shan Luo², Jianye Hao³, Jiankang Deng¹†, Stefanos Zafeiriou¹†<br>
	> 🏫 ¹Imperial College London    ²King's College London    ³Tianjin University<br>
	> ✉️ Primary contact: [f.ni@imperial.ac.uk](mailto:f.ni@imperial.ac.uk)

	Trace-conditioned Latent Action Model (LAM) checkpoint for [SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial). The LAM is a small VQ codebook trained jointly with a frozen-DINOv2 visual encoder and a trace encoder; its discrete tokens are predicted as an auxiliary semantic head by the VLM during downstream VLA training.

	## Released checkpoint

	A single unified OXE LAM trained jointly on three large-scale manipulation datasets:

	\| Field \| Value \|
	\|---\|---\|
	\| Training data \| BridgeData V2 + Fractal (RT-1) + BC-Z \|
	\| Variant \| `paper_strict` \|
	\| Image resolution \| 224 × 224 \|
	\| Trace window \| 12 \|
	\| Action horizon \| 8 \|
	\| Latent vocabulary size \| 32 \|
	\| Latent tokens per sample \| 4 \|
	\| DINOv2 visual encoder \| ViT-B/14, frozen \|
	\| Model dim \| 768 \|

	This is the same LAM consumed by the released LIBERO and Bridge SemanticVLA policies.

	## Files

	```
	SemanticVLA-LAM/
	├── README.md
	├── config.yaml # real, loadable model + data config
	├── release_metadata.yaml # human-readable summary
	└── pytorch_model.pt # LAM state_dict
	```

	## How to load

	```python
	import yaml, torch
	from semanticvla.model.modules.latent_action_model import TraceLatentActionModel

	cfg = yaml.safe_load(open("config.yaml"))

	model = TraceLatentActionModel.from_config(cfg["model"], variant=cfg["variant"])
	state = torch.load("pytorch_model.pt", map_location="cpu")
	model.load_state_dict(state)
	model.eval()
	```

	The DINOv2 visual encoder is loaded from `cfg["model"]["dino_repo_root"]` + `cfg["model"]["dino_weights"]`. Set these two paths to your local DINOv2 ViT-B/14 (e.g. the `dinov2_vitb14_pretrain.pth` file from the official DINOv2 release). The bundled `config.yaml` ships them as `${THIRD_PARTY_ROOT}` / `${DINO_WEIGHTS_PATH}` placeholders.

	## Use in downstream VLA training

	Use this LAM to precompute latent action labels for a target dataset, then train a SemanticVLA policy with those labels as the auxiliary semantic supervision target. See [`examples/OXE/`](https://github.com/Fei-Ni/SemanticVLA_Offcial/tree/main/examples/OXE) in the code repo for the three-stage recipe (trace label → LAM labels → VLA training).

	## Sibling SemanticVLA checkpoint repos

	\| Repo \| Purpose \|
	\|---\|---\|
	\| 🤗 [`SemanticVLA-LIBERO`](https://huggingface.co/spikefly/SemanticVLA-LIBERO) \| LIBERO policy that consumes this LAM \|
	\| 🤗 [`SemanticVLA-SimplerEnv`](https://huggingface.co/spikefly/SemanticVLA-SimplerEnv) \| SimplerEnv WidowX policy that consumes this LAM \|

	## Related resources

	- Code: https://github.com/Fei-Ni/SemanticVLA_Offcial
	- Datasets collection: https://hf.co/collections/spikefly/semanticvla-datasets
	- Model Zoo collection: https://hf.co/collections/spikefly/semanticvla-model-zoo

	## Citation

	```bibtex
	@inproceedings{ni2026semanticvla,
	title = {SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning},
	author = {Ni, Fei and Chen, Zhuo and Yuan, Yifu and Dong, Zibin and Yao, Xianze and Luo, Shan and Hao, Jianye and Deng, Jiankang and Zafeiriou, Stefanos},
	booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	year = {2026}
	}
	```

	## License

	Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE).