Enhance model card with metadata, links, abstract, and usage details

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +92 -4
README.md CHANGED
@@ -2,9 +2,97 @@
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
 
 
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Code: [More Information Needed]
9
- - Paper: [More Information Needed]
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
5
+ - robotics
6
+ - vision-transformer
7
+ - robot-learning
8
+ pipeline_tag: robotics
9
+ library_name: lerobot
10
+ license: mit
11
  ---
12
 
13
+ # Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
14
+
15
+ This repository contains the official code and models for the paper:
16
+ **[Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers](https://huggingface.co/papers/2507.15833)**
17
+
18
+ πŸš€ **Project Website:** [https://ian-chuang.github.io/gaze-av-aloha/](https://ian-chuang.github.io/gaze-av-aloha/)
19
+ πŸ”— **Official Codebase:** [https://github.com/ian-chuang/gaze-av-aloha](https://github.com/ian-chuang/gaze-av-aloha)
20
+
21
+ ![hero](https://github.com/ian-chuang/gaze-av-aloha/raw/main/media/hero.gif)
22
+
23
+ ## Abstract
24
+ Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. We build on recent advances in foveated image processing and apply them to an Active Vision robot system that emulates both human head movement and eye tracking. Extending prior work on the AV-ALOHA robot simulation platform, we introduce a framework for simultaneously collecting eye-tracking data and robot demonstrations from a human operator as well as a simulation benchmark and dataset for training robot policies that incorporate human gaze. Given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme inspired by recent work in image segmentation. Compared to uniform patch tokenization, this significantly reduces the number of tokens-and thus computation-without sacrificing visual fidelity near regions of interest. We also explore two approaches to gaze imitation and prediction from human data. The first is a two-stage model that predicts gaze to guide foveation and action; the second integrates gaze into the action space, allowing the policy to jointly predict gaze and actions end-to-end. Our results show that our method for foveated robot vision not only drastically reduces computational overhead, but also improves performance for high precision tasks and robustness to unseen distractors. Together, these findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems.
25
+
26
+ ## ✨ Key Features
27
+ * **Human-inspired Foveated Vision**: Integrates human gaze to guide foveated patch tokenization in Vision Transformers (ViTs), significantly reducing computation.
28
+ * **Efficiency & Robustness**: Achieves substantial computational overhead reduction while improving performance for high-precision tasks and robustness to unseen distractors.
29
+ * **Comprehensive Dataset**: Introduces a simulation benchmark and dataset for training robot policies that incorporate human gaze, collected with bimanual robot demonstrations and synchronized human eye-tracking on the AV-ALOHA simulation platform.
30
+ * **Gaze Imitation Approaches**: Explores both two-stage gaze prediction and end-to-end joint prediction of gaze and actions.
31
+
32
+ ## βš™οΈ Installation
33
+
34
+ To set up the environment and install necessary dependencies for the `gaze-av-aloha` project, which utilizes this model:
35
+
36
+ ```bash
37
+ # Clone the repository and initialize submodules
38
+ git clone https://github.com/ian-chuang/gaze-av-aloha.git
39
+ cd gaze-av-aloha
40
+ git submodule init
41
+ git submodule update
42
+
43
+ # Create and activate a new Conda environment
44
+ conda create -n gaze python=3.10
45
+ conda activate gaze
46
+
47
+ # Install LeRobot (primary library for data handling)
48
+ pip install git+https://github.com/huggingface/lerobot.git@483be9aac217c2d8ef16982490f22b2ad091ab46
49
+
50
+ # Install FFmpeg for video logging
51
+ conda install ffmpeg=7.1.1 -c conda-forge
52
+
53
+ # Install AV-ALOHA packages
54
+ pip install -e ./gym_av_aloha
55
+ pip install -e ./gaze_av_aloha
56
+ ```
57
+ Make sure you're logged in to Hugging Face CLI: `huggingface-cli login`
58
+
59
+ ## πŸš€ Usage
60
+
61
+ This model checkpoint (`iantc104/mae_vitb_foveated_vit`) represents a pretrained Vision Transformer backbone used as a component within the "Look, Focus, Act" framework. You can train and evaluate policies using the provided scripts in the original repository.
62
+
63
+ ### Example: Train a Fov-Act Policy (end-to-end gaze as action)
64
+
65
+ To train a policy that uses this foveated ViT as its vision encoder, you can use the `train.py` script from the `gaze-av-aloha` repository:
66
+
67
+ ```bash
68
+ python gaze_av_aloha/scripts/train.py \
69
+ policy=foveated_vit_policy \
70
+ task=<task_name_e.g._av_aloha_sim_thread_needle> \
71
+ policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
72
+ policy.optimizer_lr_backbone=1e-5 \
73
+ wandb.enable=true \
74
+ wandb.project=<your_project_name> \
75
+ wandb.entity=<your_wandb_entity> \
76
+ wandb.job_name=fov-act \
77
+ device=cuda
78
+ ```
79
+
80
+ Replace `<task_name_e.g._av_aloha_sim_thread_needle>` with one of the available task configurations from the project (e.g., `av_aloha_sim_cube_transfer`, `av_aloha_sim_peg_insertion`, etc.). This command will load the `iantc104/mae_vitb_foveated_vit` as the vision encoder for the `foveated_vit_policy`.
81
+
82
+ For other policy types (Fov-UNet, Fine, Coarse) and detailed instructions, please refer to the [AV ALOHA Benchmark section](https://github.com/ian-chuang/gaze-av-aloha#av-aloha-benchmark) in the official GitHub repository.
83
+
84
+ ## πŸ“„ Citation
85
+
86
+ If you find this work or model useful in your research, please consider citing the paper:
87
+
88
+ ```bibtex
89
+ @misc{chuang2025lookfocusactefficient,
90
+ title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
91
+ author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
92
+ year={2025},
93
+ eprint={2507.15833},
94
+ archivePrefix={arXiv},
95
+ primaryClass={cs.RO},
96
+ url={https://arxiv.org/abs/2507.15833},
97
+ }
98
+ ```