Instructions to use IrohXu/GazeAnywhere with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use IrohXu/GazeAnywhere with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="IrohXu/GazeAnywhere", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("IrohXu/GazeAnywhere", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
GazeAnywhere: Gaze Target Estimation Anywhere with Concepts
GazeAnywhere is a foundation model for Promptable Gaze Target Estimation (PGE). Given an image and a natural-language description of a person (a text concept), the model jointly predicts:
- the subject's head bounding box
- whether their gaze target is inside or outside the image frame
- a gaze target heatmap
This checkpoint is the dinov3txt_large_text_concept variant: a frozen DINOv3 ViT-L/16 + DINO-Txt backbone with a 6-layer fusion transformer trained on large-scale in-the-wild gaze data with appearance text prompts.
Venue: CVPR 2026
Code: github.com/IrohXu/GazeAnywhere
Model Description
Estimating human gaze targets from in-the-wild images is challenging. Prior methods often rely on brittle multi-stage pipelines that require explicit head boxes or pose estimates, and they do not support flexible natural-language subject specification.
GazeAnywhere addresses this with an end-to-end, concept-driven design. You specify who to analyze with a short text prompt (e.g. "blonde hair girl in a blue striped shirt"), and the model localizes that subject and estimates where they are looking—without separate detection or pose stages.
The architecture fuses frozen vision–language features from DINOv3 + DINO-Txt through a transformer decoder and predicts three outputs in one forward pass: head box, in/out-of-frame gaze presence, and a 64×64 gaze heatmap.
Model Details
| Variant | gazeanywhere_dinov3txt_large_text_concept |
| Vision backbone | DINOv3 ViT-L/16 (frozen) |
| Text backbone | DINO-Txt (CLIP-style BPE, 77 tokens) |
| Fusion transformer | 6 layers, dim = 512 |
| Input resolution | 512 × 512 |
| Heatmap resolution | 64 × 64 |
| Prompt type | Text appearance description |
| Format | Hugging Face Transformers (trust_remote_code=True) |
Installation
pip install "torch>=2.0" "transformers>=4.56.0" pillow opencv-python-headless
To run the visualization example from the GazeAnywhere repo, also clone the repository so the model can import its backbone code at runtime:
git clone https://github.com/IrohXu/GazeAnywhere.git
cd GazeAnywhere
pip install -r requirements.txt
Note: This model uses custom Transformers classes shipped in the repository (
configuration_gazeanywhere.py,modeling_gazeanywhere.py,processing_gazeanywhere.py). Loading requirestrust_remote_code=True.
Usage with 🤗 Transformers
Text-Conditioned Gaze Estimation
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
model_id = "IrohXu/GazeAnywhere"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()
image = Image.open("your_image.jpg").convert("RGB")
text = "appearance: light brown hair girl with blue and white striped shirt"
inputs = processor(images=image, text=text, return_tensors="pt")
inputs = {k: v.to(device) if hasattr(v, "to") else v for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_gaze_estimation(
outputs,
target_sizes=inputs["original_sizes"].tolist(),
use_dark_inference=True,
)[0]
print("Gaze point (x, y):", results["gaze"])
print("Gaze in frame:", results["inout"])
print("In/out score:", results["inout_score"])
print("Head bbox [x1, y1, x2, y2]:", results["head_bbox"])
Batched Inference
images = [Image.open(p).convert("RGB") for p in ["img1.jpg", "img2.jpg"]]
texts = [
"child in red shirt",
"woman with glasses",
]
inputs = processor(images=images, text=texts, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_gaze_estimation(
outputs,
target_sizes=inputs["original_sizes"].tolist(),
use_dark_inference=True,
)
for i, result in enumerate(results):
print(f"Image {i}: gaze={result['gaze']}, inout={result['inout']}")
Visualization
Save an overlay with heatmap, head box, gaze line, and gaze point using the visualization script from the code repository:
python examples/hf_visualization.py \
--model-dir IrohXu/GazeAnywhere \
--image-path your_image.jpg \
--text "appearance: child with golden hair" \
--save-path visualization.jpg
Green overlays indicate in-frame gaze; red head boxes indicate out-of-frame gaze.
Inputs and Outputs
Inputs
| Input | Type | Description |
|---|---|---|
images |
PIL.Image or list[PIL.Image] |
RGB image(s) |
text |
str or list[str] |
Appearance description of the target person. Prefixing with "appearance: " is recommended. |
The processor resizes images to 512×512, applies ImageNet normalization, and tokenizes text with the bundled CLIP-style BPE vocabulary (max 77 tokens).
Raw model outputs
| Output | Shape | Description |
|---|---|---|
heatmaps |
(B, 1, 64, 64) |
Sigmoid gaze heatmap |
inout_logits |
(B,) |
Sigmoid probability that gaze is inside the frame |
pred_boxes |
(B, 4) |
Head box as normalized cx, cy, w, h in [0, 1] |
Post-processed results (post_process_gaze_estimation)
| Field | Description |
|---|---|
gaze |
(x, y) gaze point in original image pixel coordinates |
inout |
True if gaze is predicted inside the frame |
inout_score |
Raw in/out probability |
head_bbox |
[x1, y1, x2, y2] head box in original image pixels |
heatmap |
Gaze heatmap resized to original image size |
Set use_dark_inference=True (default) to apply DARK sub-pixel refinement on the heatmap argmax.
Text Prompt Tips
- Describe visible appearance cues: hair color, clothing, accessories, age cues.
- Example:
"appearance: blonde hair pulled back, black tank top, pink patterned shorts woman" - The model was trained on the
appearancefield from GazeAnywhere annotations; matching that style generally works best. - One text prompt corresponds to one subject per image.
Limitations
- This release supports text concepts only; visual point prompts described in the paper are not yet exposed in the public inference API.
- Performance depends on the quality of the text description and subject visibility.
- The model is intended for research use. See the License for usage restrictions.
License
This model is released under the GazeAnywhere License. By using or distributing this model, you agree to the terms therein, including:
- Acknowledging GazeAnywhere in publications that use this model
- Complying with applicable laws and trade controls
- Not using the model for military, warfare, or weapons development purposes
Citation
If you use GazeAnywhere in your research, please cite:
@inproceedings{cao2026gaze,
title={Gaze Target Estimation Anywhere with Concepts},
author={Cao, Xu and Yang, Houze and Gunda, Vipin and Zhou, Zhongyi and Xu, Tianyu and Kowdle, Adarsh and Kim, Inki and Rehg, James M},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={31304--31315},
year={2026}
}
Acknowledgements
GazeAnywhere is developed by the UIUC Rehg Lab and Google AR. The implementation builds on DINOv3, DINO-Txt, and the Hugging Face Transformers integration pattern popularized by SAM 3.
- Downloads last month
- 31