tags:
- model_hub_mixin
- pytorch_model_hub_mixin
pipeline_tag: robotics
library_name: lerobot
license: unknown
Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
This model is part of the work presented in the paper Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers.
Project Website: https://ian-chuang.github.io/gaze-av-aloha/ Code (GitHub): https://github.com/ian-chuang/gaze-av-aloha
About the Model
Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. This work explores how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance.
This model is a component of an Active Vision robot system that emulates human head movement and eye tracking. It builds on recent advances in foveated image processing and integrates gaze information into Vision Transformers (ViTs) using a foveated patch tokenization scheme. This approach significantly reduces computational overhead (by 94%) and accelerates training (7x) and inference (3x) without sacrificing visual fidelity near regions of interest.
The framework supports simultaneously collecting eye-tracking data and robot demonstrations from a human operator, and provides a simulation benchmark and dataset for training robot policies that incorporate human gaze. The findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems.
This model has been pushed to the Hub using the PytorchModelHubMixin integration. It can be a pretrained vision encoder (like MAE-pretrained ViT weights) or a task-specific gaze model as detailed in the project's repository.
Sample Usage
To utilize this model as part of the "Look, Focus, Act" framework, you generally need to set up the environment and run training or evaluation scripts provided in the official GitHub repository. The project uses the LeRobot library for dataset handling and policy training.
For detailed installation instructions, dataset preparation, and policy training, please refer to the official GitHub repository.
Here is an example of how a foveated Vision Transformer policy can be trained, incorporating a gaze model (which this repository might represent if it's a specific gaze model instance):
# Example from the project's train.py script for Fov-UNet (two-stage with pretrained gaze model)
python gaze_av_aloha/scripts/train.py \
policy=foveated_vit_policy \
task=<task e.g. av_aloha_sim_thread_needle> \
policy.use_gaze_as_action=false \
policy.gaze_model_repo_id=iantc104/gaze_model_av_aloha_sim_thread_needle \
policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
policy.optimizer_lr_backbone=1e-5 \
wandb.enable=true \
wandb.project=<your_project_name> \
wandb.entity=<your_wandb_entity> \
wandb.job_name=fov-unet \
device=cuda
Replace <your_project_name> and <your_wandb_entity> with your Weights & Biases credentials, and <task e.g. av_aloha_sim_thread_needle> with the specific task you are working on.
Citation
If you find this work useful for your research, please consider citing the original paper:
@misc{chuang2025lookfocusactefficient,
title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
year={2025},
eprint={2507.15833},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2507.15833},
}