Enhance model card with metadata, links, and usage for robotics model
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,10 +1,88 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
tags:
|
| 3 |
- model_hub_mixin
|
| 4 |
- pytorch_model_hub_mixin
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
library_name: lerobot
|
| 5 |
tags:
|
| 6 |
- model_hub_mixin
|
| 7 |
- pytorch_model_hub_mixin
|
| 8 |
+
- gaze-tracking
|
| 9 |
+
- foveated-vision
|
| 10 |
+
- robot-learning
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
|
| 14 |
+
|
| 15 |
+
This repository contains a model related to the paper **"Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers"**.
|
| 16 |
+
|
| 17 |
+
Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. This work explores how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. Our approach significantly reduces computational overhead and improves performance and robustness in robotic tasks by leveraging foveated image processing and foveated Vision Transformers.
|
| 18 |
+
|
| 19 |
+
* **Paper:** [Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers](https://huggingface.co/papers/2507.15833)
|
| 20 |
+
* **Project Website:** [https://ian-chuang.github.io/gaze-av-aloha/](https://ian-chuang.github.io/gaze-av-aloha/)
|
| 21 |
+
* **Code Repository:** [https://github.com/ian-chuang/gaze-av-aloha](https://github.com/ian-chuang/gaze-av-aloha)
|
| 22 |
+
|
| 23 |
+

|
| 24 |
+
|
| 25 |
+
This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration.
|
| 26 |
+
|
| 27 |
+
## Installation
|
| 28 |
+
|
| 29 |
+
To set up the environment and install all necessary dependencies for the `gaze-av-aloha` project, follow these steps:
|
| 30 |
+
|
| 31 |
+
```bash
|
| 32 |
+
# Clone the repository and initialize submodules
|
| 33 |
+
git clone https://github.com/ian-chuang/gaze-av-aloha.git
|
| 34 |
+
cd gaze-av-aloha
|
| 35 |
+
git submodule init
|
| 36 |
+
git submodule update
|
| 37 |
+
|
| 38 |
+
# Create and activate a new Conda environment
|
| 39 |
+
conda create -n gaze python=3.10
|
| 40 |
+
conda activate gaze
|
| 41 |
+
|
| 42 |
+
# Install LeRobot
|
| 43 |
+
pip install git+https://github.com/huggingface/lerobot.git@483be9aac217c2d8ef16982490f22b2ad091ab46
|
| 44 |
+
|
| 45 |
+
# Install FFmpeg for video logging
|
| 46 |
+
conda install ffmpeg=7.1.1 -c conda-forge
|
| 47 |
+
|
| 48 |
+
# Install AV-ALOHA packages
|
| 49 |
+
pip install -e ./gym_av_aloha
|
| 50 |
+
pip install -e ./gaze_av_aloha
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
Make sure you're logged in to Hugging Face: `huggingface-cli login`
|
| 54 |
+
|
| 55 |
+
## Usage (Training a policy)
|
| 56 |
+
|
| 57 |
+
You can train and evaluate policies using the provided `train.py` script from the GitHub repository. Pretrained ViT weights and Gaze models are available on Hugging Face. An example for training the `Fov-Act` policy (end-to-end gaze as action) is shown below:
|
| 58 |
+
|
| 59 |
+
```bash
|
| 60 |
+
python gaze_av_aloha/scripts/train.py \
|
| 61 |
+
policy=foveated_vit_policy \
|
| 62 |
+
task=<task_name_e.g._av_aloha_sim_thread_needle> \
|
| 63 |
+
policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
|
| 64 |
+
policy.optimizer_lr_backbone=1e-5 \
|
| 65 |
+
wandb.enable=true \
|
| 66 |
+
wandb.project=<your_project_name> \
|
| 67 |
+
wandb.entity=<your_wandb_entity> \
|
| 68 |
+
wandb.job_name=fov-act \
|
| 69 |
+
device=cuda
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
Replace `<task_name_e.g._av_aloha_sim_thread_needle>`, `<your_project_name>`, and `<your_wandb_entity>` with your desired values. For detailed instructions on available tasks, other policy configurations (e.g., Fov-UNet, Fine, Coarse), and how to use pretrained models, please refer to the [official GitHub repository](https://github.com/ian-chuang/gaze-av-aloha).
|
| 73 |
+
|
| 74 |
+
## Citation
|
| 75 |
+
|
| 76 |
+
If you find this work helpful or inspiring, please consider citing it:
|
| 77 |
+
|
| 78 |
+
```bibtex
|
| 79 |
+
@misc{chuang2025lookfocusactefficient,
|
| 80 |
+
title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
|
| 81 |
+
author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
|
| 82 |
+
year={2025},
|
| 83 |
+
eprint={2507.15833},
|
| 84 |
+
archivePrefix={arXiv},
|
| 85 |
+
primaryClass={cs.RO},
|
| 86 |
+
url={https://arxiv.org/abs/2507.15833},
|
| 87 |
+
}
|
| 88 |
+
```
|