File size: 10,085 Bytes
dd1206f dbee35c dd1206f dbee35c 45f2fc4 dbee35c 45f2fc4 dbee35c 45f2fc4 dbee35c 45f2fc4 dbee35c 45f2fc4 dbee35c 45f2fc4 dbee35c 45f2fc4 dbee35c d3a5c46 dbee35c d3a5c46 dbee35c d3a5c46 dbee35c a5173ff dbee35c dd1206f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | ---
license: apache-2.0
tags:
- vla
- iclr
- iclr-2026
- vision-language-action
- spatial-understanding
- generalist-robot-policies
---
<div align="center">
# | *FALCON* | From Spatial to Actions: <br>Grounding Vision-Language-Action Model in Spatial Foundation Priors (ICLR 2026)
<a href="https://arxiv.org/abs/2510.17439" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-FALCON-red?logo=arxiv" height="25" />
</a>
<a href="https://falcon-vla.github.io/" target="_blank">
<img alt="Website" src="https://img.shields.io/badge/π_Website-falcon.io-blue.svg" height="25" />
</a>
<a href="https://github.com/FALCON-VLA/FALCON" target="_blank">
<img alt="GitHub Code: FALCON" src="https://img.shields.io/badge/Code-FALCON-181717?logo=github&logoColor=white" height="25" />
</a>
<a href="https://huggingface.co/papers/2510.17439" target="_blank">
<img alt="HF Paper: FALCON" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Paper-FALCON-ffc107?color=ffc107&logoColor=white" height="25" />
</a>
<!-- <a href="https://huggingface.co/datasets/robovlms/bytedance_robot_benchmark_20" target="_blank">
<img alt="HF Dataset: BDRBench-20" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Dataset-BDRBench20-ffc107?color=ffc107&logoColor=white" height="25" />
</a> -->
<br>
<a href="https://www.python.org/" target="_blank">
<img alt="Python 3.8" src="https://img.shields.io/badge/Python-%3E=3.8-blue" height="25" />
</a>
<a href="https://pytorch.org/" target="_blank">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-%3E=2.1-orange" height="25" />
</a>
</div>
<div align="center">
<br>
<div style="text-align: center;">
<a href="https://scholar.google.com/citations?user=8nrJ1vsAAAAJ&hl=en" target="_blank">Zhengshen Zhang</a>  
<a href="https://scholar.google.com/citations?user=4dokjDoAAAAJ&hl=zh-CN" target="_blank">Hao Li</a>  
<a href="https://scholar.google.com/citations?user=6XyNVowAAAAJ&hl=en" target="_blank">Yalun Dai</a>  
<a href="https://scholar.google.com/citations?user=ozatRA0AAAAJ&hl=zh-CN" target="_blank">Zhengbang Zhu</a>  
<a href="https://scholar.google.com/citations?user=VhToj4wAAAAJ&hl=zh-CN" target="_blank">Lei Zhou</a>  
<br>
<a href="https://sg.linkedin.com/in/liu-chenchen" target="_blank">Chenchen Liu</a>  
<a href="" target="_blank">Dong Wang</a>  
<a href="https://scholar.google.com/citations?user=mfH9UFIAAAAJ&hl=en" target="_blank">Francis E. H. Tay</a>  
<a href="https://ch3cook-fdu.github.io/" target="_blank">Sijin Chen</a>  
<br>
<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a>  
<a href="https://scholar.google.com/citations?user=i8wNtSgAAAAJ&hl=en" target="_blank">Yuxiao Liu</a><sup>*</sup><sup>†</sup>  
<a href="https://scholar.google.com/citations?user=laOWyTQAAAAJ&hl=zh-CN" target="_blank">Xinghang Li</a><sup>*</sup>  
<a href="https://panzhous.github.io/" target="_blank">Pan Zhou</a><sup>*</sup>  
<br>
<p style="text-align: center; margin-bottom: 0;">
<span class="author-note"><sup>*</sup>Corresponding Author</span> 
<span class="author-note"><sup>†</sup>Project Lead</span>
</p>
<br>
<p style="text-align: center;">
ByteDance Seed <br>
National University of Singapore   Nanyang Technological University <br>
Tsinghua University   Singapore Management University</p>
</div>
</div>
## π Introduction
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. In this work, we introduce **FALCON (From Spatial to Action)**, a novel paradigm that injects rich 3D spatial tokens into the action head of a VLA model, enabling robust spatial understanding and SOTA performance across diverse manipulation tasks without disrupting vision-language alignment. See our paper at [here](https://arxiv.org/abs/2510.17439).
## π€ Model Zoo
We provide the following model weights and their config files in our paper:
<table>
<tr>
<th>Model Name</th>
<th>VLA Model</th>
<th>Embodied Spatial Model</th>
<th>Note</th>
</tr>
<tr>
<td>FALCON-FC-CALVIN-ABC</td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/falcon-esm-fc-calvin-abc/ckpts">falcon-esm-fc-calvin-abc-pt</a></td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/esm">esm-1b</a></td>
<td>finetune on calvin-abc with RGB inputs to ESM, Tab. 4 and 5.</td>
</tr>
<tr>
<td>FALCON-FC-CALVIN-ABC-WDepth</td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/falcon-esm-fc-calvin-abc-wdepth/ckpts">falcon-esm-fc-calvin-abc-wdepth-pt</a></td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/esm">esm-1b</a></td>
<td>finetune on calvin-abc with RGB-D inputs to ESM, Tab. 5.</td>
</tr>
<tr>
<td>FALCON-3DPC-FC-CALVIN-ABC</td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/falcon-3dpc-fc-calvin-abc/ckpts">falcon-3dpc-fc-calvin-abc-pt</a></td>
<td><a href="https://github.com/YanjieZe/Improved-3D-Diffusion-Policy">improved DP3 encoder</a></td>
<td>finetune on calvin-abc with point cloud inputs to idp3 encoder, Tab. 5-Kosmos-VLA <i>(w/ rgb-d)</i>.</td>
</tr>
<tr>
<td>FALCON-LSTM-CALVIN-ABC</td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/falcon-esm-lstm-calvin-abc/ckpts">falcon-lstm-calvin-abc-pt</a></td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/esm">esm-1b</a></td>
<td>finetune on calvin-abc with RGB inputs to ESM, Tab. 1.</td>
</tr>
<tr>
<td>FALCON-LSTM-CALVIN-ABCD</td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/falcon-esm-lstm-calvin-abcd/ckpts">falcon-lstm-calvin-abcd-pt</a></td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/esm">esm-1b</a></td>
<td>finetune on calvin-abcd with RGB inputs to ESM, Tab. 1.</td>
</tr>
<tr>
<td>FALCON-FC-SimplerEnv-Bridge</td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/falcon-esm-fc-simpler-bridge/ckpts">falcon-fc-simpler-bridge-pt</a></td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/esm">esm-1b</a></td>
<td>pretrained on oxe then finetune on bridge dataset with RGB inputs to ESM, Tab. 2.</td>
</tr>
<tr>
<td>FALCON-FC-SimplerEnv-Fractal</td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/falcon-esm-fc-simpler-gr/ckpts">falcon-fc-simpler-fractal-pt</a></td>
<td><a href="https://huggingface.co/FALCON-VLA/FALCON-series/tree/main/esm">esm-1b</a></td>
<td>pretrained on oxe then finetune on fractal dataset with RGB inputs to ESM, Tab. 3.</td>
</tr>
</table>
## π¦ Usage
FALCON can be used to predict action based on the vision and language input. FALCON supports several VLA structures, multi-view input, and multi-sensory input (RGB, RGB-D, point cloud). Taking `FALCON-FC-CALVIN-ABC` as an example:
```python
import torch
import json, functools, copy
from PIL import Image
from falcon.train.base_trainer import BaseTrainer
from falcon.data.data_utils import preprocess_image, get_text_function
from falcon.model.policy_head.esm_utils.vggt.utils.load_fn import load_and_preprocess_images_square_new
configs = josn.load(open('configs/falcon-esm-fc-calvin-abc.json', 'r'))
pretrained_path = 'checkpoints/falcon-esm-fc-calvin-abc-pt'
configs['model_load_path'] = pretrained_path
model = BaseTrainer.from_checkpoint(configs)
image_fn = functools.partial(
preprocess_image,
image_processor=model.model.image_processor,
model_type=configs["model"],
)
text_fn = get_text_function(model.model.tokenizer, configs["model"])
prompt = "Task: pull the handle to open the drawer"
text_tensor, attention_mask = text_fn([prompt])
for step in range(MAX_STEPS):
image: Image.Image = get_from_side_camera(...)
# get inputs for esm
image_vggt = copy.deepcopy(image)
image = image_fn([image]).unsqueeze(0)
esm_target_size = 224
image_vggt_x, _ = load_and_preprocess_images_square_new([image_vggt], target_size=esm_target_size)
image_vggt_x = image_vggt_x.unsqueeze(0)
input_dict["rgb"] = image
input_dict["text"] = text_tensor
input_dict['text_mask'] = attention_mask
input_dict["rgb_vggt"] = image_vggt_x
### if wrist camera is available
wrist_image: Image.Image = get_from_wrist_camera(...)
wrist_image = image_fn([wrist_image]).unsqueeze(0)
input_dict["hand_rgb"] = wrist_image
with torch.no_grad():
action = model.inference_step(input_dict)["action"]
print(action)
```
## π€ FAQs
If you encounter any issues, feel free to open an issue or reach out through discussions. We appreciate your feedback and contributions! π
## ποΈ Citation
If you find this project useful in your research, please consider cite:
```BibTeX
@article{zhang2025spatial,
title={From spatial to actions: Grounding vision-language-action model in spatial foundation priors},
author={Zhang, Zhengshen and Li, Hao and Dai, Yalun and Zhu, Zhengbang and Zhou, Lei and Liu, Chenchen and Wang, Dong and Tay, Francis EH and Chen, Sijin and Liu, Ziwei and others},
journal={arXiv preprint arXiv:2510.17439},
year={2025}
}
```
## πͺͺ License
All FALCON checkpoints, as well as our [codebase](https://github.com/FALCON-VLA/FALCON) are released under the Apache-2.0 License.
## β€οΈ Acknowledgement
FALCON is built with reference to the code of the following projects: [RoboVLMs](https://github.com/Robot-VLAs/RoboVLMs/tree/main?tab=readme-ov-file), [Microsoft Kosmos-2](https://github.com/microsoft/unilm/tree/master/kosmos-2), [VGGT](https://github.com/facebookresearch/vggt), and [ManiUniCon](https://github.com/Universal-Control/ManiUniCon). Thanks for their awesome work! |