| --- |
| license: mit |
| library_name: pytorch |
| pipeline_tag: robotics |
| tags: |
| - robotics |
| - vision-language-action |
| - vla |
| - latent-action-model |
| - tokenizer |
| - manipulation |
| --- |
| |
| # SemanticVLA · Latent Action Model (LAM) |
|
|
| > 🎉 **Accepted to [CVPR 2026](https://cvpr.thecvf.com/virtual/2026/poster/39352).** |
| > ✍️ Fei Ni¹, Zhuo Chen², Yifu Yuan³, Zibin Dong³, Xianze Yao³, Shan Luo², Jianye Hao³, Jiankang Deng¹†, Stefanos Zafeiriou¹†<br> |
| > 🏫 ¹Imperial College London ²King's College London ³Tianjin University<br> |
| > ✉️ Primary contact: [f.ni@imperial.ac.uk](mailto:f.ni@imperial.ac.uk) |
|
|
| Trace-conditioned **Latent Action Model (LAM)** checkpoint for [SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial). The LAM is a small VQ codebook trained jointly with a frozen-DINOv2 visual encoder and a trace encoder; its discrete tokens are predicted as an auxiliary semantic head by the VLM during downstream VLA training. |
|
|
| ## Released checkpoint |
|
|
| A single **unified OXE LAM** trained jointly on three large-scale manipulation datasets: |
|
|
| | Field | Value | |
| |---|---| |
| | Training data | BridgeData V2 + Fractal (RT-1) + BC-Z | |
| | Variant | `paper_strict` | |
| | Image resolution | 224 × 224 | |
| | Trace window | 12 | |
| | Action horizon | 8 | |
| | Latent vocabulary size | 32 | |
| | Latent tokens per sample | 4 | |
| | DINOv2 visual encoder | ViT-B/14, frozen | |
| | Model dim | 768 | |
|
|
| This is the same LAM consumed by the released LIBERO and Bridge SemanticVLA policies. |
|
|
| ## Files |
|
|
| ``` |
| SemanticVLA-LAM/ |
| ├── README.md |
| ├── config.yaml # real, loadable model + data config |
| ├── release_metadata.yaml # human-readable summary |
| └── pytorch_model.pt # LAM state_dict |
| ``` |
|
|
| ## How to load |
|
|
| ```python |
| import yaml, torch |
| from semanticvla.model.modules.latent_action_model import TraceLatentActionModel |
| |
| cfg = yaml.safe_load(open("config.yaml")) |
| |
| model = TraceLatentActionModel.from_config(cfg["model"], variant=cfg["variant"]) |
| state = torch.load("pytorch_model.pt", map_location="cpu") |
| model.load_state_dict(state) |
| model.eval() |
| ``` |
|
|
| The DINOv2 visual encoder is loaded from `cfg["model"]["dino_repo_root"]` + `cfg["model"]["dino_weights"]`. Set these two paths to your local DINOv2 ViT-B/14 (e.g. the `dinov2_vitb14_pretrain.pth` file from the official DINOv2 release). The bundled `config.yaml` ships them as `${THIRD_PARTY_ROOT}` / `${DINO_WEIGHTS_PATH}` placeholders. |
|
|
| ## Use in downstream VLA training |
|
|
| Use this LAM to **precompute latent action labels** for a target dataset, then train a SemanticVLA policy with those labels as the auxiliary semantic supervision target. See [`examples/OXE/`](https://github.com/Fei-Ni/SemanticVLA_Offcial/tree/main/examples/OXE) in the code repo for the three-stage recipe (trace label → LAM labels → VLA training). |
|
|
| ## Sibling SemanticVLA checkpoint repos |
|
|
| | Repo | Purpose | |
| |---|---| |
| | 🤗 [`SemanticVLA-LIBERO`](https://huggingface.co/spikefly/SemanticVLA-LIBERO) | LIBERO policy that consumes this LAM | |
| | 🤗 [`SemanticVLA-SimplerEnv`](https://huggingface.co/spikefly/SemanticVLA-SimplerEnv) | SimplerEnv WidowX policy that consumes this LAM | |
|
|
| ## Related resources |
|
|
| - **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial |
| - **Datasets collection**: https://hf.co/collections/spikefly/semanticvla-datasets |
| - **Model Zoo collection**: https://hf.co/collections/spikefly/semanticvla-model-zoo |
| |
| ## Citation |
| |
| ```bibtex |
| @inproceedings{ni2026semanticvla, |
| title = {SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning}, |
| author = {Ni, Fei and Chen, Zhuo and Yuan, Yifu and Dong, Zibin and Yao, Xianze and Luo, Shan and Hao, Jianye and Deng, Jiankang and Zafeiriou, Stefanos}, |
| booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| year = {2026} |
| } |
| ``` |
| |
| ## License |
| |
| Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE). |
| |