F1-VLA / README.md
nielsr's picture
nielsr HF Staff
Update License to MIT, Add Paper Abstract, and Enhance Usage Section
8e56a18 verified
|
raw
history blame
7.99 kB
metadata
library_name: transformers
license: mit
pipeline_tag: robotics
tags:
  - vision-language-model
  - manipulation
  - robotics

🏁 Best viewed with sound on

F1: A Vision-Language-Action Model Bridging
Understanding and Generation to Actions

Paper Code Website

Abstract

Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.

πŸš€ Key Innovations

  • 🧠 Predictive Inverse Dynamics: Visual foresight generation for planning-based control
  • πŸ—οΈ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
  • πŸ“ˆ Three-Stage Training: Progressive alignment, pretraining, and adaptation

πŸ€– Real-World Robot Experiments

Diverse manipulation tasks across multiple robot platforms.

πŸ“Š Performance Summary

Task Platform F1 Ο€0 Improvement
Multi-task Genie-1 82.2% 65.2% +17.0%
Adaptation Franka 66.7% 53.3% +13.4%
Long-horizon ARX LIFT II 40.0% 0.0% +40.0%
Dynamic Env ARX LIFT II 66.7% 33.3% +33.4%

Usage

Prerequisites

  • Python β‰₯ 3.10
  • torch β‰₯ 2.6.0
  • CUDA β‰₯ 12.4

Installation

# Clone repository
git clone https://github.com/InternRobotics/F1-VLA.git
export VLA_HOME=$(pwd)
cd F1-VLA/f1_vla

# Create environment
conda create -f f1_vla python==3.10
conda activate f1_vla

# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124

# install f1_vla
pip install -e .

pip install numpy==1.26.4

For optimal performance and compatibility, we highly recommend using FFmpeg alongside TorchCodec.

  • FFmpeg is an industry-standard multimedia framework that provides robust, all-purpose video and audio processing.
  • TorchCodec is a library specifically designed for deep learning workflows in PyTorch, offering highly optimized video I/O.

By using these two tools, the time of loading the video dataset is greatly accelerated.

Download Pretrained Datasets and Models

Name link
LIBERO_SPATIAL_NO_NOOPS_PATH IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot
STAGE2_CKPT_PATH F1_pretrain
LEROBOT_PI0_PATH lerobot/pi0
PALIGEMMA_PATH google/paligemma-3b-pt-224
VAE_PATH var_d16.pth

Finetune

# 1. edit config file
vim f1_vla/config/debug_test.yaml

# 2. run the program
cd $(VLA_HOME)
python train_hf.py --config-file f1_vla/config/debug_test.yaml

πŸ“š Citation

If you find our work helpful, please cite:

@article{f1_vla_2025,
  title={F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions},
  author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Michael Yu Wang and Liqiang Nie and Jiangmiao Pang},
  eprint={2509.06951},
  archivePrefix={arXiv},
  year={2025},
  url={https://arxiv.org/abs/2509.06951}
}

License

This work is licensed under the MIT License.

Acknowledgements

This repository is based on Lerobot, Any4lerobot, and VAR.