zdai257
/

GazeMoE

Keypoint Detection

English

Model card Files Files and versions

xet

Community

Improve model card metadata and add paper link

by nielsr HF Staff - opened Mar 9

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+25

-15

Files changed (1) hide show

README.md +25 -15

README.md CHANGED Viewed

@@ -1,21 +1,25 @@
 ---
-license: mit
 datasets:
 - ShijianDeng/gazefollow
 language:
 - en
-base_model:
-- facebook/dinov2-large
-pipeline_tag: keypoint-detection
 ---
 # GazeMoE: Gaze Estimation with Mixture-of-Experts
 GazeMoE is a lightweight gaze estimation model (14MB decoder) built on top of a frozen **DINOv2 Vit-L/14** backbone. It uses a Mixture-of-Experts (MoE) transformer decoder to predict whether a person's gaze target is inside or outside the camera frame and generates a heatmap for the gaze location.
----
-papers:
-  - https://arxiv.org/abs/2603.06256
----
 ## Quick Start
@@ -23,7 +27,6 @@ papers:
 ```bash
 pip install torch torchvision timm huggingface_hub numpy Pillow
 ```
 ### 2. Hello World Example
@@ -87,7 +90,6 @@ else:
     print(f"Estimated Gaze Target (Normalized): x={x_norm:.2f}, y={y_norm:.2f}")
     print(f"Pixel Coordinates: X={x_norm * w:.1f}, Y={y_norm * h:.1f}")
 ```
 ---
@@ -99,18 +101,26 @@ else:
 The model consumes a dictionary:
 * **`images`**: A `torch.Tensor` of shape `(Batch, 3, 448, 448)`. Use the `transform` provided by the factory function to ensure correct normalization and resizing.
-* **`bboxes`**: A list of lists. Each sub-list corresponds to an image in the batch and contains the head bounding box proposals in **normalized coordinates** .
 ### Output Decoding
 The model returns a dictionary with two keys:
-1. **`inout`**: A sigmoid output. Values  indicate the person is looking at something outside the image boundaries.
-2. **`heatmap`**: A  spatial map. The gaze target is typically identified by taking the `argmax` of this map to find the peak intensity coordinate.
 ---
 ### Citation
-If you use this model in your research, please link to the original GitHub repository:
-[zdai257/DisengageNet](https://github.com/zdai257/DisengageNet/tree/gazemoe)

 ---
+base_model:
+- facebook/dinov2-large
 datasets:
 - ShijianDeng/gazefollow
 language:
 - en
+license: mit
+pipeline_tag: other
+tags:
+- gaze-estimation
+- mixture-of-experts
+- computer-vision
 ---
 # GazeMoE: Gaze Estimation with Mixture-of-Experts
 GazeMoE is a lightweight gaze estimation model (14MB decoder) built on top of a frozen **DINOv2 Vit-L/14** backbone. It uses a Mixture-of-Experts (MoE) transformer decoder to predict whether a person's gaze target is inside or outside the camera frame and generates a heatmap for the gaze location.
+This model was presented in the paper [GazeMoE: Perception of Gaze Target with Mixture-of-Experts](https://huggingface.co/papers/2603.06256).
+The official code is available in the [GitHub repository](https://github.com/zdai257/DisengageNet/tree/gazemoe).
 ## Quick Start
 ```bash
 pip install torch torchvision timm huggingface_hub numpy Pillow
 ```
 ### 2. Hello World Example
     print(f"Estimated Gaze Target (Normalized): x={x_norm:.2f}, y={y_norm:.2f}")
     print(f"Pixel Coordinates: X={x_norm * w:.1f}, Y={y_norm * h:.1f}")
 ```
 ---
 The model consumes a dictionary:
 * **`images`**: A `torch.Tensor` of shape `(Batch, 3, 448, 448)`. Use the `transform` provided by the factory function to ensure correct normalization and resizing.
+* **`bboxes`**: A list of lists. Each sub-list corresponds to an image in the batch and contains the head bounding box proposals in **normalized coordinates**.
 ### Output Decoding
 The model returns a dictionary with two keys:
+1. **`inout`**: A sigmoid output. Values indicate the probability of the person looking at something inside the image boundaries.
+2. **`heatmap`**: A spatial map (64x64). The gaze target is typically identified by taking the `argmax` of this map to find the peak intensity coordinate.
 ---
 ### Citation
+If you use this model in your research, please cite the following paper:
+```bibtex
+@article{dai2026gazemoe,
+  title={GazeMoE: Perception of Gaze Target with Mixture-of-Experts},
+  author={Dai, Zhuangzhuang and Lu, Zhongxi and Zakka, Vincent G. and Manso, Luis J. and Calero, Jose M Alcaraz and Li, Chen},
+  journal={arXiv preprint arXiv:2603.06256},
+  year={2026}
+}
+```