Improve model card metadata and add paper link
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,21 +1,25 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
datasets:
|
| 4 |
- ShijianDeng/gazefollow
|
| 5 |
language:
|
| 6 |
- en
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
|
|
|
| 11 |
# GazeMoE: Gaze Estimation with Mixture-of-Experts
|
| 12 |
|
| 13 |
GazeMoE is a lightweight gaze estimation model (14MB decoder) built on top of a frozen **DINOv2 Vit-L/14** backbone. It uses a Mixture-of-Experts (MoE) transformer decoder to predict whether a person's gaze target is inside or outside the camera frame and generates a heatmap for the gaze location.
|
| 14 |
|
| 15 |
-
--
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
---
|
| 19 |
|
| 20 |
## Quick Start
|
| 21 |
|
|
@@ -23,7 +27,6 @@ papers:
|
|
| 23 |
|
| 24 |
```bash
|
| 25 |
pip install torch torchvision timm huggingface_hub numpy Pillow
|
| 26 |
-
|
| 27 |
```
|
| 28 |
|
| 29 |
### 2. Hello World Example
|
|
@@ -87,7 +90,6 @@ else:
|
|
| 87 |
|
| 88 |
print(f"Estimated Gaze Target (Normalized): x={x_norm:.2f}, y={y_norm:.2f}")
|
| 89 |
print(f"Pixel Coordinates: X={x_norm * w:.1f}, Y={y_norm * h:.1f}")
|
| 90 |
-
|
| 91 |
```
|
| 92 |
|
| 93 |
---
|
|
@@ -99,18 +101,26 @@ else:
|
|
| 99 |
The model consumes a dictionary:
|
| 100 |
|
| 101 |
* **`images`**: A `torch.Tensor` of shape `(Batch, 3, 448, 448)`. Use the `transform` provided by the factory function to ensure correct normalization and resizing.
|
| 102 |
-
* **`bboxes`**: A list of lists. Each sub-list corresponds to an image in the batch and contains the head bounding box proposals in **normalized coordinates**
|
| 103 |
|
| 104 |
### Output Decoding
|
| 105 |
|
| 106 |
The model returns a dictionary with two keys:
|
| 107 |
|
| 108 |
-
1. **`inout`**: A sigmoid output. Values
|
| 109 |
-
2. **`heatmap`**: A
|
| 110 |
|
| 111 |
---
|
| 112 |
|
| 113 |
### Citation
|
| 114 |
|
| 115 |
-
If you use this model in your research, please
|
| 116 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- facebook/dinov2-large
|
| 4 |
datasets:
|
| 5 |
- ShijianDeng/gazefollow
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
+
license: mit
|
| 9 |
+
pipeline_tag: other
|
| 10 |
+
tags:
|
| 11 |
+
- gaze-estimation
|
| 12 |
+
- mixture-of-experts
|
| 13 |
+
- computer-vision
|
| 14 |
---
|
| 15 |
+
|
| 16 |
# GazeMoE: Gaze Estimation with Mixture-of-Experts
|
| 17 |
|
| 18 |
GazeMoE is a lightweight gaze estimation model (14MB decoder) built on top of a frozen **DINOv2 Vit-L/14** backbone. It uses a Mixture-of-Experts (MoE) transformer decoder to predict whether a person's gaze target is inside or outside the camera frame and generates a heatmap for the gaze location.
|
| 19 |
|
| 20 |
+
This model was presented in the paper [GazeMoE: Perception of Gaze Target with Mixture-of-Experts](https://huggingface.co/papers/2603.06256).
|
| 21 |
+
|
| 22 |
+
The official code is available in the [GitHub repository](https://github.com/zdai257/DisengageNet/tree/gazemoe).
|
|
|
|
| 23 |
|
| 24 |
## Quick Start
|
| 25 |
|
|
|
|
| 27 |
|
| 28 |
```bash
|
| 29 |
pip install torch torchvision timm huggingface_hub numpy Pillow
|
|
|
|
| 30 |
```
|
| 31 |
|
| 32 |
### 2. Hello World Example
|
|
|
|
| 90 |
|
| 91 |
print(f"Estimated Gaze Target (Normalized): x={x_norm:.2f}, y={y_norm:.2f}")
|
| 92 |
print(f"Pixel Coordinates: X={x_norm * w:.1f}, Y={y_norm * h:.1f}")
|
|
|
|
| 93 |
```
|
| 94 |
|
| 95 |
---
|
|
|
|
| 101 |
The model consumes a dictionary:
|
| 102 |
|
| 103 |
* **`images`**: A `torch.Tensor` of shape `(Batch, 3, 448, 448)`. Use the `transform` provided by the factory function to ensure correct normalization and resizing.
|
| 104 |
+
* **`bboxes`**: A list of lists. Each sub-list corresponds to an image in the batch and contains the head bounding box proposals in **normalized coordinates**.
|
| 105 |
|
| 106 |
### Output Decoding
|
| 107 |
|
| 108 |
The model returns a dictionary with two keys:
|
| 109 |
|
| 110 |
+
1. **`inout`**: A sigmoid output. Values indicate the probability of the person looking at something inside the image boundaries.
|
| 111 |
+
2. **`heatmap`**: A spatial map (64x64). The gaze target is typically identified by taking the `argmax` of this map to find the peak intensity coordinate.
|
| 112 |
|
| 113 |
---
|
| 114 |
|
| 115 |
### Citation
|
| 116 |
|
| 117 |
+
If you use this model in your research, please cite the following paper:
|
| 118 |
+
|
| 119 |
+
```bibtex
|
| 120 |
+
@article{dai2026gazemoe,
|
| 121 |
+
title={GazeMoE: Perception of Gaze Target with Mixture-of-Experts},
|
| 122 |
+
author={Dai, Zhuangzhuang and Lu, Zhongxi and Zakka, Vincent G. and Manso, Luis J. and Calero, Jose M Alcaraz and Li, Chen},
|
| 123 |
+
journal={arXiv preprint arXiv:2603.06256},
|
| 124 |
+
year={2026}
|
| 125 |
+
}
|
| 126 |
+
```
|