README: put author/affiliation/email on separate lines

54b395c verified 4 days ago

4.01 kB

license: mit
library_name: pytorch
pipeline_tag: robotics
tags:
  - robotics
  - vision-language-action
  - vla
  - latent-action-model
  - tokenizer
  - manipulation

SemanticVLA · Latent Action Model (LAM)

🎉 Accepted to CVPR 2026. ✍️ Fei Ni¹, Zhuo Chen², Yifu Yuan³, Zibin Dong³, Xianze Yao³, Shan Luo², Jianye Hao³, Jiankang Deng¹†, Stefanos Zafeiriou¹†
🏫 ¹Imperial College London ²King's College London ³Tianjin University
✉️ Primary contact: f.ni@imperial.ac.uk

Trace-conditioned Latent Action Model (LAM) checkpoint for SemanticVLA. The LAM is a small VQ codebook trained jointly with a frozen-DINOv2 visual encoder and a trace encoder; its discrete tokens are predicted as an auxiliary semantic head by the VLM during downstream VLA training.

Released checkpoint

A single unified OXE LAM trained jointly on three large-scale manipulation datasets:

Field	Value
Training data	BridgeData V2 + Fractal (RT-1) + BC-Z
Variant	`paper_strict`
Image resolution	224 × 224
Trace window	12
Action horizon	8
Latent vocabulary size	32
Latent tokens per sample	4
DINOv2 visual encoder	ViT-B/14, frozen
Model dim	768

This is the same LAM consumed by the released LIBERO and Bridge SemanticVLA policies.

Files

SemanticVLA-LAM/
├── README.md
├── config.yaml              # real, loadable model + data config
├── release_metadata.yaml    # human-readable summary
└── pytorch_model.pt         # LAM state_dict

How to load

import yaml, torch
from semanticvla.model.modules.latent_action_model import TraceLatentActionModel

cfg = yaml.safe_load(open("config.yaml"))

model = TraceLatentActionModel.from_config(cfg["model"], variant=cfg["variant"])
state = torch.load("pytorch_model.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()

The DINOv2 visual encoder is loaded from cfg["model"]["dino_repo_root"] + cfg["model"]["dino_weights"]. Set these two paths to your local DINOv2 ViT-B/14 (e.g. the dinov2_vitb14_pretrain.pth file from the official DINOv2 release). The bundled config.yaml ships them as ${THIRD_PARTY_ROOT} / ${DINO_WEIGHTS_PATH} placeholders.

Use in downstream VLA training

Use this LAM to precompute latent action labels for a target dataset, then train a SemanticVLA policy with those labels as the auxiliary semantic supervision target. See examples/OXE/ in the code repo for the three-stage recipe (trace label → LAM labels → VLA training).

Sibling SemanticVLA checkpoint repos

Repo	Purpose
🤗 `SemanticVLA-LIBERO`	LIBERO policy that consumes this LAM
🤗 `SemanticVLA-SimplerEnv`	SimplerEnv WidowX policy that consumes this LAM

Related resources

Code: https://github.com/Fei-Ni/SemanticVLA_Offcial
Datasets collection: https://hf.co/collections/spikefly/semanticvla-datasets
Model Zoo collection: https://hf.co/collections/spikefly/semanticvla-model-zoo

Citation

@inproceedings{ni2026semanticvla,
  title     = {SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning},
  author    = {Ni, Fei and Chen, Zhuo and Yuan, Yifu and Dong, Zibin and Yao, Xianze and Luo, Shan and Hao, Jianye and Deng, Jiankang and Zafeiriou, Stefanos},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

License

Released under the MIT License.