| --- |
| license: mit |
| library_name: pytorch |
| pipeline_tag: robotics |
| tags: |
| - robotics |
| - vision-language-action |
| - vla |
| - simpler-env |
| - widowx |
| - bridge |
| - manipulation |
| - qwen-vl |
| --- |
| |
| # SemanticVLA · SimplerEnv (WidowX) |
|
|
| > 🎉 **Accepted to [CVPR 2026](https://cvpr.thecvf.com/virtual/2026/poster/39352).** |
| > ✍️ Fei Ni¹, Zhuo Chen², Yifu Yuan³, Zibin Dong³, Xianze Yao³, Shan Luo², Jianye Hao³, Jiankang Deng¹†, Stefanos Zafeiriou¹†<br> |
| > 🏫 ¹Imperial College London ²King's College London ³Tianjin University<br> |
| > ✉️ Primary contact: [f.ni@imperial.ac.uk](mailto:f.ni@imperial.ac.uk) |
|
|
| [SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial) policy trained on BridgeData V2 (Open X-Embodiment `bridge_orig`) for **100K steps**, intended for [SimplerEnv](https://github.com/simpler-env/SimplerEnv) WidowX evaluation. The unified OXE LAM is used as the latent-action tokenizer, and the trace + latent-action auxiliary heads are supervised in the VLM's language stream. |
|
|
| ## Headline result (SimplerEnv WidowX) |
|
|
| | Task | Success rate | |
| |---|---:| |
| | Put Eggplant in Basket | 0.958 | |
| | Spoon on Towel | 1.000 | |
| | Carrot on Plate | 0.792 | |
| | Stack Cube | 0.458 | |
| | **Mean** | **0.802** | |
|
|
| ## Architecture |
|
|
| | Component | Choice | |
| |---|---| |
| | VLM backbone | Qwen3-VL-4B-Instruct | |
| | Action head | DiT-B (flow matching) | |
| | LAM tokenizer | [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) (unified OXE LAM) | |
| | Semantic supervision | Trace + latent action tokens predicted in the VLM's language stream; action decoder unmodified | |
| | Latent vocabulary size | 32 | |
| | Latent tokens per sample | 4 | |
| | Action horizon | 16 | |
|
|
| ## Training data |
|
|
| This checkpoint is trained on **BridgeData V2** (Open X-Embodiment `bridge_orig`) for 100K steps. It is intended specifically for SimplerEnv WidowX evaluation and is **not** meant as a general-purpose policy for unrelated robot embodiments. |
|
|
| ## Files |
|
|
| ``` |
| SemanticVLA-SimplerEnv/ |
| ├── README.md |
| ├── config.yaml # loadable model config |
| ├── dataset_statistics.json # action normalization stats |
| └── final_model/ |
| └── pytorch_model.pt # policy state_dict |
| ``` |
|
|
| ## How to load |
|
|
| ```python |
| from semanticvla.model.framework.base_framework import baseframework |
| |
| policy = baseframework.from_pretrained("pytorch_model.pt") |
| policy.eval() |
| ``` |
|
|
| `baseframework.from_pretrained()` walks two directory levels up from the checkpoint file to locate `config.yaml` and `dataset_statistics.json`. The released layout follows this convention. |
|
|
| To run the SimplerEnv WidowX suite, see [`examples/SimplerEnv/`](https://github.com/Fei-Ni/SemanticVLA_Offcial/tree/main/examples/SimplerEnv) in the code repo. |
|
|
| ## Sibling SemanticVLA checkpoint repos |
|
|
| | Repo | Purpose | |
| |---|---| |
| | 🤗 [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) | Unified OXE LAM consumed by this policy | |
| | 🤗 [`SemanticVLA-LIBERO`](https://huggingface.co/spikefly/SemanticVLA-LIBERO) | LIBERO policy | |
|
|
| ## Related resources |
|
|
| - **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial |
| - **Dataset (BridgeData V2 in LeRobot v3 with dense traces)**: 🤗 [`SemanticVLA-TraceX-240K-Bridge`](https://huggingface.co/datasets/spikefly/SemanticVLA-TraceX-240K-Bridge) |
| - **Datasets collection**: https://hf.co/collections/spikefly/semanticvla-datasets |
| - **Model Zoo collection**: https://hf.co/collections/spikefly/semanticvla-model-zoo |
| |
| ## Citation |
| |
| ```bibtex |
| @inproceedings{ni2026semanticvla, |
| title = {SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning}, |
| author = {Ni, Fei and Chen, Zhuo and Yuan, Yifu and Dong, Zibin and Yao, Xianze and Luo, Shan and Hao, Jianye and Deng, Jiankang and Zafeiriou, Stefanos}, |
| booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| year = {2026} |
| } |
| ``` |
| |
| ## License |
| |
| Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE), subject to the upstream BridgeData V2 license. |
| |