--- license: apache-2.0 tags: - vla - iclr - iclr-2026 - vision-language-action - spatial-understanding - generalist-robot-policies ---
# | *FALCON* | From Spatial to Actions:
Grounding Vision-Language-Action Model in Spatial Foundation Priors (ICLR 2026) arXiv Website GitHub Code: FALCON HF Paper: FALCON
Python 3.8 PyTorch

Zhengshen ZhangHao LiYalun DaiZhengbang ZhuLei Zhou
Chenchen LiuDong WangFrancis E. H. TaySijin Chen
Ziwei LiuYuxiao Liu*Xinghang Li*Pan Zhou*

*Corresponding AuthorProject Lead


ByteDance Seed
National University of Singapore   Nanyang Technological University
Tsinghua University   Singapore Management University

## 🚀 Introduction Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. In this work, we introduce **FALCON (From Spatial to Action)**, a novel paradigm that injects rich 3D spatial tokens into the action head of a VLA model, enabling robust spatial understanding and SOTA performance across diverse manipulation tasks without disrupting vision-language alignment. See our paper at [here](https://arxiv.org/abs/2510.17439). ## 🤗 Model Zoo We provide the following model weights and their config files in our paper:
Model Name VLA Model Embodied Spatial Model Note
FALCON-FC-CALVIN-ABC falcon-esm-fc-calvin-abc-pt esm-1b finetune on calvin-abc with RGB inputs to ESM, Tab. 4 and 5.
FALCON-FC-CALVIN-ABC-WDepth falcon-esm-fc-calvin-abc-wdepth-pt esm-1b finetune on calvin-abc with RGB-D inputs to ESM, Tab. 5.
FALCON-3DPC-FC-CALVIN-ABC falcon-3dpc-fc-calvin-abc-pt improved DP3 encoder finetune on calvin-abc with point cloud inputs to idp3 encoder, Tab. 5-Kosmos-VLA (w/ rgb-d).
FALCON-LSTM-CALVIN-ABC falcon-lstm-calvin-abc-pt esm-1b finetune on calvin-abc with RGB inputs to ESM, Tab. 1.
FALCON-LSTM-CALVIN-ABCD falcon-lstm-calvin-abcd-pt esm-1b finetune on calvin-abcd with RGB inputs to ESM, Tab. 1.
FALCON-FC-SimplerEnv-Bridge falcon-fc-simpler-bridge-pt esm-1b pretrained on oxe then finetune on bridge dataset with RGB inputs to ESM, Tab. 2.
FALCON-FC-SimplerEnv-Fractal falcon-fc-simpler-fractal-pt esm-1b pretrained on oxe then finetune on fractal dataset with RGB inputs to ESM, Tab. 3.
## 📦 Usage FALCON can be used to predict action based on the vision and language input. FALCON supports several VLA structures, multi-view input, and multi-sensory input (RGB, RGB-D, point cloud). Taking `FALCON-FC-CALVIN-ABC` as an example: ```python import torch import json, functools, copy from PIL import Image from falcon.train.base_trainer import BaseTrainer from falcon.data.data_utils import preprocess_image, get_text_function from falcon.model.policy_head.esm_utils.vggt.utils.load_fn import load_and_preprocess_images_square_new configs = josn.load(open('configs/falcon-esm-fc-calvin-abc.json', 'r')) pretrained_path = 'checkpoints/falcon-esm-fc-calvin-abc-pt' configs['model_load_path'] = pretrained_path model = BaseTrainer.from_checkpoint(configs) image_fn = functools.partial( preprocess_image, image_processor=model.model.image_processor, model_type=configs["model"], ) text_fn = get_text_function(model.model.tokenizer, configs["model"]) prompt = "Task: pull the handle to open the drawer" text_tensor, attention_mask = text_fn([prompt]) for step in range(MAX_STEPS): image: Image.Image = get_from_side_camera(...) # get inputs for esm image_vggt = copy.deepcopy(image) image = image_fn([image]).unsqueeze(0) esm_target_size = 224 image_vggt_x, _ = load_and_preprocess_images_square_new([image_vggt], target_size=esm_target_size) image_vggt_x = image_vggt_x.unsqueeze(0) input_dict["rgb"] = image input_dict["text"] = text_tensor input_dict['text_mask'] = attention_mask input_dict["rgb_vggt"] = image_vggt_x ### if wrist camera is available wrist_image: Image.Image = get_from_wrist_camera(...) wrist_image = image_fn([wrist_image]).unsqueeze(0) input_dict["hand_rgb"] = wrist_image with torch.no_grad(): action = model.inference_step(input_dict)["action"] print(action) ``` ## 🤗 FAQs If you encounter any issues, feel free to open an issue or reach out through discussions. We appreciate your feedback and contributions! 🚀 ## 🖊️ Citation If you find this project useful in your research, please consider cite: ```BibTeX @article{zhang2025spatial, title={From spatial to actions: Grounding vision-language-action model in spatial foundation priors}, author={Zhang, Zhengshen and Li, Hao and Dai, Yalun and Zhu, Zhengbang and Zhou, Lei and Liu, Chenchen and Wang, Dong and Tay, Francis EH and Chen, Sijin and Liu, Ziwei and others}, journal={arXiv preprint arXiv:2510.17439}, year={2025} } ``` ## 🪪 License All FALCON checkpoints, as well as our [codebase](https://github.com/FALCON-VLA/FALCON) are released under the Apache-2.0 License. ## ❤️ Acknowledgement FALCON is built with reference to the code of the following projects: [RoboVLMs](https://github.com/Robot-VLAs/RoboVLMs/tree/main?tab=readme-ov-file), [Microsoft Kosmos-2](https://github.com/microsoft/unilm/tree/master/kosmos-2), [VGGT](https://github.com/facebookresearch/vggt), and [ManiUniCon](https://github.com/Universal-Control/ManiUniCon). Thanks for their awesome work!