File size: 3,495 Bytes
6a6aed4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff6bdee
12da83a
 
 
f93b90c
3a4a841
6a6aed4
 
 
 
 
 
 
3a4a841
 
 
6a6aed4
 
 
3a4a841
 
 
 
 
 
 
 
 
6a6aed4
3a4a841
6a6aed4
 
 
3a4a841
 
 
 
 
6a6aed4
 
3a4a841
 
 
 
 
d543a1c
3a4a841
 
 
 
 
 
6a6aed4
 
 
 
 
3a4a841
 
6a6aed4
 
 
 
3a4a841
 
6a6aed4
 
 
 
f93b90c
 
 
 
 
6a6aed4
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: mit
library_name: pytorch
pipeline_tag: robotics
tags:
  - robotics
  - vision-language-action
  - vla
  - libero
  - manipulation
  - qwen-vl
---

# SemanticVLA · LIBERO

> 🎉 **Accepted to [CVPR 2026](https://cvpr.thecvf.com/virtual/2026/poster/39352).**
> ✍️ Fei Ni¹, Zhuo Chen², Yifu Yuan³, Zibin Dong³, Xianze Yao³, Shan Luo², Jianye Hao³, Jiankang Deng¹†, Stefanos Zafeiriou¹†<br>
> 🏫 ¹Imperial College London &nbsp;&nbsp; ²King's College London &nbsp;&nbsp; ³Tianjin University<br>
> ✉️ Primary contact: [f.ni@imperial.ac.uk](mailto:f.ni@imperial.ac.uk)

[SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial) finetuned on the [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO) benchmark. The unified OXE LAM is used as the latent-action tokenizer, and the trace + latent-action auxiliary heads are supervised in the VLM's language stream.

## Headline result

| Suite | Success rate |
|---|---:|
| `libero_spatial` | 0.988 |
| `libero_object`  | 0.996 |
| `libero_goal`    | 0.974 |
| `libero_10`      | 0.970 |
| **4-suite mean** | **0.982** |

## Architecture

| Component | Choice |
|---|---|
| VLM backbone | Qwen3-VL-4B-Instruct |
| Action head | DiT-B (flow matching) |
| LAM tokenizer | [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) (unified OXE LAM) |
| Semantic supervision | Trace + latent action tokens predicted in the VLM's language stream; action decoder unmodified |
| Latent vocabulary size | 32 |
| Latent tokens per sample | 4 |
| Action horizon | 8 |

## Files

```
SemanticVLA-LIBERO/
├── README.md
├── config.yaml              # loadable model config
├── dataset_statistics.json  # action normalization stats
└── final_model/
    └── pytorch_model.pt     # policy state_dict
```

## How to load

```python
from semanticvla.model.framework.base_framework import baseframework

policy = baseframework.from_pretrained("pytorch_model.pt")
policy.eval()
```

`baseframework.from_pretrained()` walks two directory levels up from the checkpoint file to locate `config.yaml` and `dataset_statistics.json`. The released layout follows this convention.

To run a full LIBERO evaluation, see [`examples/LIBERO/`](https://github.com/Fei-Ni/SemanticVLA_Offcial/tree/main/examples/LIBERO) in the code repo.

## Sibling SemanticVLA checkpoint repos

| Repo | Purpose |
|---|---|
| 🤗 [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) | Unified OXE LAM consumed by this policy |
| 🤗 [`SemanticVLA-SimplerEnv`](https://huggingface.co/spikefly/SemanticVLA-SimplerEnv) | SimplerEnv WidowX policy |

## Related resources

- **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial
- **Datasets collection**: https://hf.co/collections/spikefly/semanticvla-datasets
- **Model Zoo collection**: https://hf.co/collections/spikefly/semanticvla-model-zoo

## Citation

```bibtex
@inproceedings{ni2026semanticvla,
  title     = {SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning},
  author    = {Ni, Fei and Chen, Zhuo and Yuan, Yifu and Dong, Zibin and Yao, Xianze and Luo, Shan and Hao, Jianye and Deng, Jiankang and Zafeiriou, Stefanos},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}
```

## License

Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE).