File size: 4,150 Bytes
683f092
 
 
 
 
 
 
 
 
152ab68
 
683f092
 
 
 
152ab68
683f092
ba01b26
92a41aa
 
 
d402670
152ab68
683f092
152ab68
683f092
152ab68
 
 
 
 
 
 
683f092
152ab68
 
 
683f092
152ab68
 
 
 
 
 
683f092
 
152ab68
683f092
152ab68
683f092
152ab68
683f092
 
152ab68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
683f092
 
152ab68
 
 
 
683f092
 
 
 
152ab68
 
683f092
 
 
 
152ab68
 
 
683f092
 
 
 
d402670
 
 
 
 
683f092
 
 
 
 
152ab68
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
license: mit
library_name: pytorch
pipeline_tag: robotics
tags:
  - robotics
  - vision-language-action
  - vla
  - simpler-env
  - widowx
  - bridge
  - manipulation
  - qwen-vl
---

# SemanticVLA · SimplerEnv (WidowX)

> 🎉 **Accepted to [CVPR 2026](https://cvpr.thecvf.com/virtual/2026/poster/39352).**
> ✍️ Fei Ni¹, Zhuo Chen², Yifu Yuan³, Zibin Dong³, Xianze Yao³, Shan Luo², Jianye Hao³, Jiankang Deng¹†, Stefanos Zafeiriou¹†<br>
> 🏫 ¹Imperial College London &nbsp;&nbsp; ²King's College London &nbsp;&nbsp; ³Tianjin University<br>
> ✉️ Primary contact: [f.ni@imperial.ac.uk](mailto:f.ni@imperial.ac.uk)

[SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial) policy trained on BridgeData V2 (Open X-Embodiment `bridge_orig`) for **100K steps**, intended for [SimplerEnv](https://github.com/simpler-env/SimplerEnv) WidowX evaluation. The unified OXE LAM is used as the latent-action tokenizer, and the trace + latent-action auxiliary heads are supervised in the VLM's language stream.

## Headline result (SimplerEnv WidowX)

| Task | Success rate |
|---|---:|
| Put Eggplant in Basket | 0.958 |
| Spoon on Towel         | 1.000 |
| Carrot on Plate        | 0.792 |
| Stack Cube             | 0.458 |
| **Mean**               | **0.802** |

## Architecture

| Component | Choice |
|---|---|
| VLM backbone | Qwen3-VL-4B-Instruct |
| Action head | DiT-B (flow matching) |
| LAM tokenizer | [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) (unified OXE LAM) |
| Semantic supervision | Trace + latent action tokens predicted in the VLM's language stream; action decoder unmodified |
| Latent vocabulary size | 32 |
| Latent tokens per sample | 4 |
| Action horizon | 16 |

## Training data

This checkpoint is trained on **BridgeData V2** (Open X-Embodiment `bridge_orig`) for 100K steps. It is intended specifically for SimplerEnv WidowX evaluation and is **not** meant as a general-purpose policy for unrelated robot embodiments.

## Files

```
SemanticVLA-SimplerEnv/
├── README.md
├── config.yaml              # loadable model config
├── dataset_statistics.json  # action normalization stats
└── final_model/
    └── pytorch_model.pt     # policy state_dict
```

## How to load

```python
from semanticvla.model.framework.base_framework import baseframework

policy = baseframework.from_pretrained("pytorch_model.pt")
policy.eval()
```

`baseframework.from_pretrained()` walks two directory levels up from the checkpoint file to locate `config.yaml` and `dataset_statistics.json`. The released layout follows this convention.

To run the SimplerEnv WidowX suite, see [`examples/SimplerEnv/`](https://github.com/Fei-Ni/SemanticVLA_Offcial/tree/main/examples/SimplerEnv) in the code repo.

## Sibling SemanticVLA checkpoint repos

| Repo | Purpose |
|---|---|
| 🤗 [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) | Unified OXE LAM consumed by this policy |
| 🤗 [`SemanticVLA-LIBERO`](https://huggingface.co/spikefly/SemanticVLA-LIBERO) | LIBERO policy |

## Related resources

- **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial
- **Dataset (BridgeData V2 in LeRobot v3 with dense traces)**: 🤗 [`SemanticVLA-TraceX-240K-Bridge`](https://huggingface.co/datasets/spikefly/SemanticVLA-TraceX-240K-Bridge)
- **Datasets collection**: https://hf.co/collections/spikefly/semanticvla-datasets
- **Model Zoo collection**: https://hf.co/collections/spikefly/semanticvla-model-zoo

## Citation

```bibtex
@inproceedings{ni2026semanticvla,
  title     = {SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning},
  author    = {Ni, Fei and Chen, Zhuo and Yuan, Yifu and Dong, Zibin and Yao, Xianze and Luo, Shan and Hao, Jianye and Deng, Jiankang and Zafeiriou, Stefanos},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}
```

## License

Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE), subject to the upstream BridgeData V2 license.