Improve model card metadata and add paper link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +25 -15
README.md CHANGED
@@ -1,21 +1,25 @@
1
  ---
2
- license: mit
 
3
  datasets:
4
  - ShijianDeng/gazefollow
5
  language:
6
  - en
7
- base_model:
8
- - facebook/dinov2-large
9
- pipeline_tag: keypoint-detection
 
 
 
10
  ---
 
11
  # GazeMoE: Gaze Estimation with Mixture-of-Experts
12
 
13
  GazeMoE is a lightweight gaze estimation model (14MB decoder) built on top of a frozen **DINOv2 Vit-L/14** backbone. It uses a Mixture-of-Experts (MoE) transformer decoder to predict whether a person's gaze target is inside or outside the camera frame and generates a heatmap for the gaze location.
14
 
15
- ---
16
- papers:
17
- - https://arxiv.org/abs/2603.06256
18
- ---
19
 
20
  ## Quick Start
21
 
@@ -23,7 +27,6 @@ papers:
23
 
24
  ```bash
25
  pip install torch torchvision timm huggingface_hub numpy Pillow
26
-
27
  ```
28
 
29
  ### 2. Hello World Example
@@ -87,7 +90,6 @@ else:
87
 
88
  print(f"Estimated Gaze Target (Normalized): x={x_norm:.2f}, y={y_norm:.2f}")
89
  print(f"Pixel Coordinates: X={x_norm * w:.1f}, Y={y_norm * h:.1f}")
90
-
91
  ```
92
 
93
  ---
@@ -99,18 +101,26 @@ else:
99
  The model consumes a dictionary:
100
 
101
  * **`images`**: A `torch.Tensor` of shape `(Batch, 3, 448, 448)`. Use the `transform` provided by the factory function to ensure correct normalization and resizing.
102
- * **`bboxes`**: A list of lists. Each sub-list corresponds to an image in the batch and contains the head bounding box proposals in **normalized coordinates** .
103
 
104
  ### Output Decoding
105
 
106
  The model returns a dictionary with two keys:
107
 
108
- 1. **`inout`**: A sigmoid output. Values indicate the person is looking at something outside the image boundaries.
109
- 2. **`heatmap`**: A spatial map. The gaze target is typically identified by taking the `argmax` of this map to find the peak intensity coordinate.
110
 
111
  ---
112
 
113
  ### Citation
114
 
115
- If you use this model in your research, please link to the original GitHub repository:
116
- [zdai257/DisengageNet](https://github.com/zdai257/DisengageNet/tree/gazemoe)
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - facebook/dinov2-large
4
  datasets:
5
  - ShijianDeng/gazefollow
6
  language:
7
  - en
8
+ license: mit
9
+ pipeline_tag: other
10
+ tags:
11
+ - gaze-estimation
12
+ - mixture-of-experts
13
+ - computer-vision
14
  ---
15
+
16
  # GazeMoE: Gaze Estimation with Mixture-of-Experts
17
 
18
  GazeMoE is a lightweight gaze estimation model (14MB decoder) built on top of a frozen **DINOv2 Vit-L/14** backbone. It uses a Mixture-of-Experts (MoE) transformer decoder to predict whether a person's gaze target is inside or outside the camera frame and generates a heatmap for the gaze location.
19
 
20
+ This model was presented in the paper [GazeMoE: Perception of Gaze Target with Mixture-of-Experts](https://huggingface.co/papers/2603.06256).
21
+
22
+ The official code is available in the [GitHub repository](https://github.com/zdai257/DisengageNet/tree/gazemoe).
 
23
 
24
  ## Quick Start
25
 
 
27
 
28
  ```bash
29
  pip install torch torchvision timm huggingface_hub numpy Pillow
 
30
  ```
31
 
32
  ### 2. Hello World Example
 
90
 
91
  print(f"Estimated Gaze Target (Normalized): x={x_norm:.2f}, y={y_norm:.2f}")
92
  print(f"Pixel Coordinates: X={x_norm * w:.1f}, Y={y_norm * h:.1f}")
 
93
  ```
94
 
95
  ---
 
101
  The model consumes a dictionary:
102
 
103
  * **`images`**: A `torch.Tensor` of shape `(Batch, 3, 448, 448)`. Use the `transform` provided by the factory function to ensure correct normalization and resizing.
104
+ * **`bboxes`**: A list of lists. Each sub-list corresponds to an image in the batch and contains the head bounding box proposals in **normalized coordinates**.
105
 
106
  ### Output Decoding
107
 
108
  The model returns a dictionary with two keys:
109
 
110
+ 1. **`inout`**: A sigmoid output. Values indicate the probability of the person looking at something inside the image boundaries.
111
+ 2. **`heatmap`**: A spatial map (64x64). The gaze target is typically identified by taking the `argmax` of this map to find the peak intensity coordinate.
112
 
113
  ---
114
 
115
  ### Citation
116
 
117
+ If you use this model in your research, please cite the following paper:
118
+
119
+ ```bibtex
120
+ @article{dai2026gazemoe,
121
+ title={GazeMoE: Perception of Gaze Target with Mixture-of-Experts},
122
+ author={Dai, Zhuangzhuang and Lu, Zhongxi and Zakka, Vincent G. and Manso, Luis J. and Calero, Jose M Alcaraz and Li, Chen},
123
+ journal={arXiv preprint arXiv:2603.06256},
124
+ year={2026}
125
+ }
126
+ ```