---
license: apache-2.0
tags:
- vla
- iclr
- iclr-2026
- vision-language-action
- spatial-understanding
- generalist-robot-policies
---
## 📦 Usage
FALCON can be used to predict action based on the vision and language input. FALCON supports several VLA structures, multi-view input, and multi-sensory input (RGB, RGB-D, point cloud). Taking `FALCON-FC-CALVIN-ABC` as an example:
```python
import torch
import json, functools, copy
from PIL import Image
from falcon.train.base_trainer import BaseTrainer
from falcon.data.data_utils import preprocess_image, get_text_function
from falcon.model.policy_head.esm_utils.vggt.utils.load_fn import load_and_preprocess_images_square_new
configs = josn.load(open('configs/falcon-esm-fc-calvin-abc.json', 'r'))
pretrained_path = 'checkpoints/falcon-esm-fc-calvin-abc-pt'
configs['model_load_path'] = pretrained_path
model = BaseTrainer.from_checkpoint(configs)
image_fn = functools.partial(
preprocess_image,
image_processor=model.model.image_processor,
model_type=configs["model"],
)
text_fn = get_text_function(model.model.tokenizer, configs["model"])
prompt = "Task: pull the handle to open the drawer"
text_tensor, attention_mask = text_fn([prompt])
for step in range(MAX_STEPS):
image: Image.Image = get_from_side_camera(...)
# get inputs for esm
image_vggt = copy.deepcopy(image)
image = image_fn([image]).unsqueeze(0)
esm_target_size = 224
image_vggt_x, _ = load_and_preprocess_images_square_new([image_vggt], target_size=esm_target_size)
image_vggt_x = image_vggt_x.unsqueeze(0)
input_dict["rgb"] = image
input_dict["text"] = text_tensor
input_dict['text_mask'] = attention_mask
input_dict["rgb_vggt"] = image_vggt_x
### if wrist camera is available
wrist_image: Image.Image = get_from_wrist_camera(...)
wrist_image = image_fn([wrist_image]).unsqueeze(0)
input_dict["hand_rgb"] = wrist_image
with torch.no_grad():
action = model.inference_step(input_dict)["action"]
print(action)
```
## 🤗 FAQs
If you encounter any issues, feel free to open an issue or reach out through discussions. We appreciate your feedback and contributions! 🚀
## 🖊️ Citation
If you find this project useful in your research, please consider cite:
```BibTeX
@article{zhang2025spatial,
title={From spatial to actions: Grounding vision-language-action model in spatial foundation priors},
author={Zhang, Zhengshen and Li, Hao and Dai, Yalun and Zhu, Zhengbang and Zhou, Lei and Liu, Chenchen and Wang, Dong and Tay, Francis EH and Chen, Sijin and Liu, Ziwei and others},
journal={arXiv preprint arXiv:2510.17439},
year={2025}
}
```
## 🪪 License
All FALCON checkpoints, as well as our [codebase](https://github.com/FALCON-VLA/FALCON) are released under the Apache-2.0 License.
## ❤️ Acknowledgement
FALCON is built with reference to the code of the following projects: [RoboVLMs](https://github.com/Robot-VLAs/RoboVLMs/tree/main?tab=readme-ov-file), [Microsoft Kosmos-2](https://github.com/microsoft/unilm/tree/master/kosmos-2), [VGGT](https://github.com/facebookresearch/vggt), and [ManiUniCon](https://github.com/Universal-Control/ManiUniCon). Thanks for their awesome work!