---
license: mit
library_name: pytorch
pipeline_tag: robotics
tags:
  - robotics
  - vision-language-action
  - vla
  - latent-action-model
  - tokenizer
  - manipulation
---

# SemanticVLA · Latent Action Model (LAM)

> 🎉 **Accepted to [CVPR 2026](https://cvpr.thecvf.com/virtual/2026/poster/39352).**
> ✍️ Fei Ni¹, Zhuo Chen², Yifu Yuan³, Zibin Dong³, Xianze Yao³, Shan Luo², Jianye Hao³, Jiankang Deng¹†, Stefanos Zafeiriou¹†<br>
> 🏫 ¹Imperial College London &nbsp;&nbsp; ²King's College London &nbsp;&nbsp; ³Tianjin University<br>
> ✉️ Primary contact: [f.ni@imperial.ac.uk](mailto:f.ni@imperial.ac.uk)

Trace-conditioned **Latent Action Model (LAM)** checkpoint for [SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial). The LAM is a small VQ codebook trained jointly with a frozen-DINOv2 visual encoder and a trace encoder; its discrete tokens are predicted as an auxiliary semantic head by the VLM during downstream VLA training.

## Released checkpoint

A single **unified OXE LAM** trained jointly on three large-scale manipulation datasets:

| Field | Value |
|---|---|
| Training data | BridgeData V2 + Fractal (RT-1) + BC-Z |
| Variant | `paper_strict` |
| Image resolution | 224 × 224 |
| Trace window | 12 |
| Action horizon | 8 |
| Latent vocabulary size | 32 |
| Latent tokens per sample | 4 |
| DINOv2 visual encoder | ViT-B/14, frozen |
| Model dim | 768 |

This is the same LAM consumed by the released LIBERO and Bridge SemanticVLA policies.

## Files

```
SemanticVLA-LAM/
├── README.md
├── config.yaml              # real, loadable model + data config
├── release_metadata.yaml    # human-readable summary
└── pytorch_model.pt         # LAM state_dict
```

## How to load

```python
import yaml, torch
from semanticvla.model.modules.latent_action_model import TraceLatentActionModel

cfg = yaml.safe_load(open("config.yaml"))

model = TraceLatentActionModel.from_config(cfg["model"], variant=cfg["variant"])
state = torch.load("pytorch_model.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()
```

The DINOv2 visual encoder is loaded from `cfg["model"]["dino_repo_root"]` + `cfg["model"]["dino_weights"]`. Set these two paths to your local DINOv2 ViT-B/14 (e.g. the `dinov2_vitb14_pretrain.pth` file from the official DINOv2 release). The bundled `config.yaml` ships them as `${THIRD_PARTY_ROOT}` / `${DINO_WEIGHTS_PATH}` placeholders.

## Use in downstream VLA training

Use this LAM to **precompute latent action labels** for a target dataset, then train a SemanticVLA policy with those labels as the auxiliary semantic supervision target. See [`examples/OXE/`](https://github.com/Fei-Ni/SemanticVLA_Offcial/tree/main/examples/OXE) in the code repo for the three-stage recipe (trace label → LAM labels → VLA training).

## Sibling SemanticVLA checkpoint repos

| Repo | Purpose |
|---|---|
| 🤗 [`SemanticVLA-LIBERO`](https://huggingface.co/spikefly/SemanticVLA-LIBERO) | LIBERO policy that consumes this LAM |
| 🤗 [`SemanticVLA-SimplerEnv`](https://huggingface.co/spikefly/SemanticVLA-SimplerEnv) | SimplerEnv WidowX policy that consumes this LAM |

## Related resources

- **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial
- **Datasets collection**: https://hf.co/collections/spikefly/semanticvla-datasets
- **Model Zoo collection**: https://hf.co/collections/spikefly/semanticvla-model-zoo

## Citation

```bibtex
@inproceedings{ni2026semanticvla,
  title     = {SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning},
  author    = {Ni, Fei and Chen, Zhuo and Yuan, Yifu and Dong, Zibin and Yao, Xianze and Luo, Shan and Hao, Jianye and Deng, Jiankang and Zafeiriou, Stefanos},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}
```

## License

Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE).