Enhance model card with metadata, links, and usage for robotics model

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +82 -4
README.md CHANGED
@@ -1,10 +1,88 @@
1
  ---
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Code: [More Information Needed]
9
- - Paper: [More Information Needed]
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ pipeline_tag: robotics
4
+ library_name: lerobot
5
  tags:
6
  - model_hub_mixin
7
  - pytorch_model_hub_mixin
8
+ - gaze-tracking
9
+ - foveated-vision
10
+ - robot-learning
11
  ---
12
 
13
+ # Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
14
+
15
+ This repository contains a model related to the paper **"Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers"**.
16
+
17
+ Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. This work explores how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. Our approach significantly reduces computational overhead and improves performance and robustness in robotic tasks by leveraging foveated image processing and foveated Vision Transformers.
18
+
19
+ * **Paper:** [Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers](https://huggingface.co/papers/2507.15833)
20
+ * **Project Website:** [https://ian-chuang.github.io/gaze-av-aloha/](https://ian-chuang.github.io/gaze-av-aloha/)
21
+ * **Code Repository:** [https://github.com/ian-chuang/gaze-av-aloha](https://github.com/ian-chuang/gaze-av-aloha)
22
+
23
+ ![hero](https://github.com/ian-chuang/gaze-av-aloha/raw/main/media/hero.gif)
24
+
25
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration.
26
+
27
+ ## Installation
28
+
29
+ To set up the environment and install all necessary dependencies for the `gaze-av-aloha` project, follow these steps:
30
+
31
+ ```bash
32
+ # Clone the repository and initialize submodules
33
+ git clone https://github.com/ian-chuang/gaze-av-aloha.git
34
+ cd gaze-av-aloha
35
+ git submodule init
36
+ git submodule update
37
+
38
+ # Create and activate a new Conda environment
39
+ conda create -n gaze python=3.10
40
+ conda activate gaze
41
+
42
+ # Install LeRobot
43
+ pip install git+https://github.com/huggingface/lerobot.git@483be9aac217c2d8ef16982490f22b2ad091ab46
44
+
45
+ # Install FFmpeg for video logging
46
+ conda install ffmpeg=7.1.1 -c conda-forge
47
+
48
+ # Install AV-ALOHA packages
49
+ pip install -e ./gym_av_aloha
50
+ pip install -e ./gaze_av_aloha
51
+ ```
52
+
53
+ Make sure you're logged in to Hugging Face: `huggingface-cli login`
54
+
55
+ ## Usage (Training a policy)
56
+
57
+ You can train and evaluate policies using the provided `train.py` script from the GitHub repository. Pretrained ViT weights and Gaze models are available on Hugging Face. An example for training the `Fov-Act` policy (end-to-end gaze as action) is shown below:
58
+
59
+ ```bash
60
+ python gaze_av_aloha/scripts/train.py \
61
+ policy=foveated_vit_policy \
62
+ task=<task_name_e.g._av_aloha_sim_thread_needle> \
63
+ policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
64
+ policy.optimizer_lr_backbone=1e-5 \
65
+ wandb.enable=true \
66
+ wandb.project=<your_project_name> \
67
+ wandb.entity=<your_wandb_entity> \
68
+ wandb.job_name=fov-act \
69
+ device=cuda
70
+ ```
71
+
72
+ Replace `<task_name_e.g._av_aloha_sim_thread_needle>`, `<your_project_name>`, and `<your_wandb_entity>` with your desired values. For detailed instructions on available tasks, other policy configurations (e.g., Fov-UNet, Fine, Coarse), and how to use pretrained models, please refer to the [official GitHub repository](https://github.com/ian-chuang/gaze-av-aloha).
73
+
74
+ ## Citation
75
+
76
+ If you find this work helpful or inspiring, please consider citing it:
77
+
78
+ ```bibtex
79
+ @misc{chuang2025lookfocusactefficient,
80
+ title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
81
+ author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
82
+ year={2025},
83
+ eprint={2507.15833},
84
+ archivePrefix={arXiv},
85
+ primaryClass={cs.RO},
86
+ url={https://arxiv.org/abs/2507.15833},
87
+ }
88
+ ```