mae_vitb_vit / README.md

nielsr HF Staff

Enhance model card with metadata, links, abstract, and usage details

dbad048 verified 9 months ago

6.13 kB

tags:
  - model_hub_mixin
  - pytorch_model_hub_mixin
  - robotics
  - vision-transformer
  - robot-learning
pipeline_tag: robotics
library_name: lerobot
license: mit

Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

This repository contains the official code and models for the paper: Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

🚀 Project Website: https://ian-chuang.github.io/gaze-av-aloha/ 🔗 Official Codebase: https://github.com/ian-chuang/gaze-av-aloha

Abstract

Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. We build on recent advances in foveated image processing and apply them to an Active Vision robot system that emulates both human head movement and eye tracking. Extending prior work on the AV-ALOHA robot simulation platform, we introduce a framework for simultaneously collecting eye-tracking data and robot demonstrations from a human operator as well as a simulation benchmark and dataset for training robot policies that incorporate human gaze. Given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme inspired by recent work in image segmentation. Compared to uniform patch tokenization, this significantly reduces the number of tokens-and thus computation-without sacrificing visual fidelity near regions of interest. We also explore two approaches to gaze imitation and prediction from human data. The first is a two-stage model that predicts gaze to guide foveation and action; the second integrates gaze into the action space, allowing the policy to jointly predict gaze and actions end-to-end. Our results show that our method for foveated robot vision not only drastically reduces computational overhead, but also improves performance for high precision tasks and robustness to unseen distractors. Together, these findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems.

✨ Key Features

Human-inspired Foveated Vision: Integrates human gaze to guide foveated patch tokenization in Vision Transformers (ViTs), significantly reducing computation.
Efficiency & Robustness: Achieves substantial computational overhead reduction while improving performance for high-precision tasks and robustness to unseen distractors.
Comprehensive Dataset: Introduces a simulation benchmark and dataset for training robot policies that incorporate human gaze, collected with bimanual robot demonstrations and synchronized human eye-tracking on the AV-ALOHA simulation platform.
Gaze Imitation Approaches: Explores both two-stage gaze prediction and end-to-end joint prediction of gaze and actions.

⚙️ Installation

To set up the environment and install necessary dependencies for the gaze-av-aloha project, which utilizes this model:

# Clone the repository and initialize submodules
git clone https://github.com/ian-chuang/gaze-av-aloha.git
cd gaze-av-aloha
git submodule init
git submodule update

# Create and activate a new Conda environment
conda create -n gaze python=3.10
conda activate gaze

# Install LeRobot (primary library for data handling)
pip install git+https://github.com/huggingface/lerobot.git@483be9aac217c2d8ef16982490f22b2ad091ab46

# Install FFmpeg for video logging
conda install ffmpeg=7.1.1 -c conda-forge

# Install AV-ALOHA packages
pip install -e ./gym_av_aloha
pip install -e ./gaze_av_aloha

Make sure you're logged in to Hugging Face CLI: huggingface-cli login

🚀 Usage

This model checkpoint (iantc104/mae_vitb_foveated_vit) represents a pretrained Vision Transformer backbone used as a component within the "Look, Focus, Act" framework. You can train and evaluate policies using the provided scripts in the original repository.

Example: Train a Fov-Act Policy (end-to-end gaze as action)

To train a policy that uses this foveated ViT as its vision encoder, you can use the train.py script from the gaze-av-aloha repository:

python gaze_av_aloha/scripts/train.py \
  policy=foveated_vit_policy \
  task=<task_name_e.g._av_aloha_sim_thread_needle> \
  policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
  policy.optimizer_lr_backbone=1e-5 \
  wandb.enable=true \
  wandb.project=<your_project_name> \
  wandb.entity=<your_wandb_entity> \
  wandb.job_name=fov-act \
  device=cuda

Replace <task_name_e.g._av_aloha_sim_thread_needle> with one of the available task configurations from the project (e.g., av_aloha_sim_cube_transfer, av_aloha_sim_peg_insertion, etc.). This command will load the iantc104/mae_vitb_foveated_vit as the vision encoder for the foveated_vit_policy.

For other policy types (Fov-UNet, Fine, Coarse) and detailed instructions, please refer to the AV ALOHA Benchmark section in the official GitHub repository.

📄 Citation

If you find this work or model useful in your research, please consider citing the paper:

@misc{chuang2025lookfocusactefficient,
      title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
      author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
      year={2025},
      eprint={2507.15833},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2507.15833},
}