Enhance model card with metadata, links, abstract, and usage details
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -2,9 +2,97 @@
|
|
| 2 |
tags:
|
| 3 |
- model_hub_mixin
|
| 4 |
- pytorch_model_hub_mixin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
tags:
|
| 3 |
- model_hub_mixin
|
| 4 |
- pytorch_model_hub_mixin
|
| 5 |
+
- robotics
|
| 6 |
+
- vision-transformer
|
| 7 |
+
- robot-learning
|
| 8 |
+
pipeline_tag: robotics
|
| 9 |
+
library_name: lerobot
|
| 10 |
+
license: mit
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
|
| 14 |
+
|
| 15 |
+
This repository contains the official code and models for the paper:
|
| 16 |
+
**[Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers](https://huggingface.co/papers/2507.15833)**
|
| 17 |
+
|
| 18 |
+
π **Project Website:** [https://ian-chuang.github.io/gaze-av-aloha/](https://ian-chuang.github.io/gaze-av-aloha/)
|
| 19 |
+
π **Official Codebase:** [https://github.com/ian-chuang/gaze-av-aloha](https://github.com/ian-chuang/gaze-av-aloha)
|
| 20 |
+
|
| 21 |
+

|
| 22 |
+
|
| 23 |
+
## Abstract
|
| 24 |
+
Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. We build on recent advances in foveated image processing and apply them to an Active Vision robot system that emulates both human head movement and eye tracking. Extending prior work on the AV-ALOHA robot simulation platform, we introduce a framework for simultaneously collecting eye-tracking data and robot demonstrations from a human operator as well as a simulation benchmark and dataset for training robot policies that incorporate human gaze. Given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme inspired by recent work in image segmentation. Compared to uniform patch tokenization, this significantly reduces the number of tokens-and thus computation-without sacrificing visual fidelity near regions of interest. We also explore two approaches to gaze imitation and prediction from human data. The first is a two-stage model that predicts gaze to guide foveation and action; the second integrates gaze into the action space, allowing the policy to jointly predict gaze and actions end-to-end. Our results show that our method for foveated robot vision not only drastically reduces computational overhead, but also improves performance for high precision tasks and robustness to unseen distractors. Together, these findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems.
|
| 25 |
+
|
| 26 |
+
## β¨ Key Features
|
| 27 |
+
* **Human-inspired Foveated Vision**: Integrates human gaze to guide foveated patch tokenization in Vision Transformers (ViTs), significantly reducing computation.
|
| 28 |
+
* **Efficiency & Robustness**: Achieves substantial computational overhead reduction while improving performance for high-precision tasks and robustness to unseen distractors.
|
| 29 |
+
* **Comprehensive Dataset**: Introduces a simulation benchmark and dataset for training robot policies that incorporate human gaze, collected with bimanual robot demonstrations and synchronized human eye-tracking on the AV-ALOHA simulation platform.
|
| 30 |
+
* **Gaze Imitation Approaches**: Explores both two-stage gaze prediction and end-to-end joint prediction of gaze and actions.
|
| 31 |
+
|
| 32 |
+
## βοΈ Installation
|
| 33 |
+
|
| 34 |
+
To set up the environment and install necessary dependencies for the `gaze-av-aloha` project, which utilizes this model:
|
| 35 |
+
|
| 36 |
+
```bash
|
| 37 |
+
# Clone the repository and initialize submodules
|
| 38 |
+
git clone https://github.com/ian-chuang/gaze-av-aloha.git
|
| 39 |
+
cd gaze-av-aloha
|
| 40 |
+
git submodule init
|
| 41 |
+
git submodule update
|
| 42 |
+
|
| 43 |
+
# Create and activate a new Conda environment
|
| 44 |
+
conda create -n gaze python=3.10
|
| 45 |
+
conda activate gaze
|
| 46 |
+
|
| 47 |
+
# Install LeRobot (primary library for data handling)
|
| 48 |
+
pip install git+https://github.com/huggingface/lerobot.git@483be9aac217c2d8ef16982490f22b2ad091ab46
|
| 49 |
+
|
| 50 |
+
# Install FFmpeg for video logging
|
| 51 |
+
conda install ffmpeg=7.1.1 -c conda-forge
|
| 52 |
+
|
| 53 |
+
# Install AV-ALOHA packages
|
| 54 |
+
pip install -e ./gym_av_aloha
|
| 55 |
+
pip install -e ./gaze_av_aloha
|
| 56 |
+
```
|
| 57 |
+
Make sure you're logged in to Hugging Face CLI: `huggingface-cli login`
|
| 58 |
+
|
| 59 |
+
## π Usage
|
| 60 |
+
|
| 61 |
+
This model checkpoint (`iantc104/mae_vitb_foveated_vit`) represents a pretrained Vision Transformer backbone used as a component within the "Look, Focus, Act" framework. You can train and evaluate policies using the provided scripts in the original repository.
|
| 62 |
+
|
| 63 |
+
### Example: Train a Fov-Act Policy (end-to-end gaze as action)
|
| 64 |
+
|
| 65 |
+
To train a policy that uses this foveated ViT as its vision encoder, you can use the `train.py` script from the `gaze-av-aloha` repository:
|
| 66 |
+
|
| 67 |
+
```bash
|
| 68 |
+
python gaze_av_aloha/scripts/train.py \
|
| 69 |
+
policy=foveated_vit_policy \
|
| 70 |
+
task=<task_name_e.g._av_aloha_sim_thread_needle> \
|
| 71 |
+
policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
|
| 72 |
+
policy.optimizer_lr_backbone=1e-5 \
|
| 73 |
+
wandb.enable=true \
|
| 74 |
+
wandb.project=<your_project_name> \
|
| 75 |
+
wandb.entity=<your_wandb_entity> \
|
| 76 |
+
wandb.job_name=fov-act \
|
| 77 |
+
device=cuda
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
Replace `<task_name_e.g._av_aloha_sim_thread_needle>` with one of the available task configurations from the project (e.g., `av_aloha_sim_cube_transfer`, `av_aloha_sim_peg_insertion`, etc.). This command will load the `iantc104/mae_vitb_foveated_vit` as the vision encoder for the `foveated_vit_policy`.
|
| 81 |
+
|
| 82 |
+
For other policy types (Fov-UNet, Fine, Coarse) and detailed instructions, please refer to the [AV ALOHA Benchmark section](https://github.com/ian-chuang/gaze-av-aloha#av-aloha-benchmark) in the official GitHub repository.
|
| 83 |
+
|
| 84 |
+
## π Citation
|
| 85 |
+
|
| 86 |
+
If you find this work or model useful in your research, please consider citing the paper:
|
| 87 |
+
|
| 88 |
+
```bibtex
|
| 89 |
+
@misc{chuang2025lookfocusactefficient,
|
| 90 |
+
title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
|
| 91 |
+
author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
|
| 92 |
+
year={2025},
|
| 93 |
+
eprint={2507.15833},
|
| 94 |
+
archivePrefix={arXiv},
|
| 95 |
+
primaryClass={cs.RO},
|
| 96 |
+
url={https://arxiv.org/abs/2507.15833},
|
| 97 |
+
}
|
| 98 |
+
```
|