iantc104
/

gaze_model_av_aloha_sim_thread_needle

@@ -2,9 +2,69 @@
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Code: [More Information Needed]
-- Paper: [More Information Needed]
-- Docs: [More Information Needed]

 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
+pipeline_tag: robotics
+library_name: lerobot
+license: unknown
 ---
+# Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
+This model is part of the work presented in the paper [**Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers**](https://huggingface.co/papers/2507.15833).
+<div align="center">
+  <img src="https://ian-chuang.github.io/gaze-av-aloha/media/hero.gif" alt="Look, Focus, Act Hero GIF" width="100%">
+</div>
+**Project Website:** [https://ian-chuang.github.io/gaze-av-aloha/](https://ian-chuang.github.io/gaze-av-aloha/)
+**Code (GitHub):** [https://github.com/ian-chuang/gaze-av-aloha](https://github.com/ian-chuang/gaze-av-aloha)
+## About the Model
+Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. This work explores how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance.
+This model is a component of an Active Vision robot system that emulates human head movement and eye tracking. It builds on recent advances in foveated image processing and integrates gaze information into Vision Transformers (ViTs) using a foveated patch tokenization scheme. This approach significantly reduces computational overhead (by 94%) and accelerates training (7x) and inference (3x) without sacrificing visual fidelity near regions of interest.
+The framework supports simultaneously collecting eye-tracking data and robot demonstrations from a human operator, and provides a simulation benchmark and dataset for training robot policies that incorporate human gaze. The findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems.
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration. It can be a pretrained vision encoder (like MAE-pretrained ViT weights) or a task-specific gaze model as detailed in the project's repository.
+## Sample Usage
+To utilize this model as part of the "Look, Focus, Act" framework, you generally need to set up the environment and run training or evaluation scripts provided in the official GitHub repository. The project uses the `LeRobot` library for dataset handling and policy training.
+For detailed installation instructions, dataset preparation, and policy training, please refer to the [official GitHub repository](https://github.com/ian-chuang/gaze-av-aloha).
+Here is an example of how a foveated Vision Transformer policy can be trained, incorporating a gaze model (which this repository might represent if it's a specific gaze model instance):
+```bash
+# Example from the project's train.py script for Fov-UNet (two-stage with pretrained gaze model)
+python gaze_av_aloha/scripts/train.py \
+  policy=foveated_vit_policy \
+  task=<task e.g. av_aloha_sim_thread_needle> \
+  policy.use_gaze_as_action=false \
+  policy.gaze_model_repo_id=iantc104/gaze_model_av_aloha_sim_thread_needle \
+  policy.vision_encoder_kwargs.repo_id=iantc104/mae_vitb_foveated_vit \
+  policy.optimizer_lr_backbone=1e-5 \
+  wandb.enable=true \
+  wandb.project=<your_project_name> \
+  wandb.entity=<your_wandb_entity> \
+  wandb.job_name=fov-unet \
+  device=cuda
+```
+Replace `<your_project_name>` and `<your_wandb_entity>` with your Weights & Biases credentials, and `<task e.g. av_aloha_sim_thread_needle>` with the specific task you are working on.
+## Citation
+If you find this work useful for your research, please consider citing the original paper:
+```bibtex
+@misc{chuang2025lookfocusactefficient,
+      title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
+      author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
+      year={2025},
+      eprint={2507.15833},
+      archivePrefix={arXiv},
+      primaryClass={cs.RO},
+      url={https://arxiv.org/abs/2507.15833},
+}
+```