nielsr's picture
nielsr HF Staff
Improve model card with metadata, links, and usage details
f7562c7 verified
|
raw
history blame
4.19 kB
metadata
tags:
  - model_hub_mixin
  - pytorch_model_hub_mixin
pipeline_tag: robotics
library_name: lerobot
license: unknown

Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

This model is part of the work presented in the paper Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers.

Look, Focus, Act Hero GIF

Project Website: https://ian-chuang.github.io/gaze-av-aloha/ Code (GitHub): https://github.com/ian-chuang/gaze-av-aloha

About the Model

Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. This work explores how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance.

This model is a component of an Active Vision robot system that emulates human head movement and eye tracking. It builds on recent advances in foveated image processing and integrates gaze information into Vision Transformers (ViTs) using a foveated patch tokenization scheme. This approach significantly reduces computational overhead (by 94%) and accelerates training (7x) and inference (3x) without sacrificing visual fidelity near regions of interest.

The framework supports simultaneously collecting eye-tracking data and robot demonstrations from a human operator, and provides a simulation benchmark and dataset for training robot policies that incorporate human gaze. The findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems.

This model has been pushed to the Hub using the PytorchModelHubMixin integration. It can be a pretrained vision encoder (like MAE-pretrained ViT weights) or a task-specific gaze model as detailed in the project's repository.

Sample Usage

To utilize this model as part of the "Look, Focus, Act" framework, you generally need to set up the environment and run training or evaluation scripts provided in the official GitHub repository. The project uses the LeRobot library for dataset handling and policy training.

For detailed installation instructions, dataset preparation, and policy training, please refer to the official GitHub repository.

Here is an example of how a foveated Vision Transformer policy can be trained, incorporating a gaze model (which this repository might represent if it's a specific gaze model instance):

# Example from the project's train.py script for Fov-UNet (two-stage with pretrained gaze model)
python gaze_av_aloha/scripts/train.py \
  policy=foveated_vit_policy \
  task=<task e.g. av_aloha_sim_thread_needle> \
  policy.use_gaze_as_action=false \
  policy.gaze_model_repo_id=iantc104/gaze_model_av_aloha_sim_thread_needle \
  policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
  policy.optimizer_lr_backbone=1e-5 \
  wandb.enable=true \
  wandb.project=<your_project_name> \
  wandb.entity=<your_wandb_entity> \
  wandb.job_name=fov-unet \
  device=cuda

Replace <your_project_name> and <your_wandb_entity> with your Weights & Biases credentials, and <task e.g. av_aloha_sim_thread_needle> with the specific task you are working on.

Citation

If you find this work useful for your research, please consider citing the original paper:

@misc{chuang2025lookfocusactefficient,
      title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers}, 
      author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
      year={2025},
      eprint={2507.15833},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2507.15833}, 
}