You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

GazeAnywhere: Gaze Target Estimation Anywhere with Concepts

GazeAnywhere is a foundation model for Promptable Gaze Target Estimation (PGE). Given an image and a natural-language description of a person (a text concept), the model jointly predicts:

the subject's head bounding box
whether their gaze target is inside or outside the image frame
a gaze target heatmap

This checkpoint is the dinov3txt_large_text_concept variant: a frozen DINOv3 ViT-L/16 + DINO-Txt backbone with a 6-layer fusion transformer trained on large-scale in-the-wild gaze data with appearance text prompts.

Venue: CVPR 2026
Code: github.com/IrohXu/GazeAnywhere

Model Description

Estimating human gaze targets from in-the-wild images is challenging. Prior methods often rely on brittle multi-stage pipelines that require explicit head boxes or pose estimates, and they do not support flexible natural-language subject specification.

GazeAnywhere addresses this with an end-to-end, concept-driven design. You specify who to analyze with a short text prompt (e.g. "blonde hair girl in a blue striped shirt"), and the model localizes that subject and estimates where they are looking—without separate detection or pose stages.

The architecture fuses frozen vision–language features from DINOv3 + DINO-Txt through a transformer decoder and predicts three outputs in one forward pass: head box, in/out-of-frame gaze presence, and a 64×64 gaze heatmap.

Model Details


Variant	`gazeanywhere_dinov3txt_large_text_concept`
Vision backbone	DINOv3 ViT-L/16 (frozen)
Text backbone	DINO-Txt (CLIP-style BPE, 77 tokens)
Fusion transformer	6 layers, dim = 512
Input resolution	512 × 512
Heatmap resolution	64 × 64
Prompt type	Text appearance description
Format	Hugging Face Transformers (`trust_remote_code=True`)

Installation

pip install "torch>=2.0" "transformers>=4.56.0" pillow opencv-python-headless

To run the visualization example from the GazeAnywhere repo, also clone the repository so the model can import its backbone code at runtime:

git clone https://github.com/IrohXu/GazeAnywhere.git
cd GazeAnywhere
pip install -r requirements.txt

Note: This model uses custom Transformers classes shipped in the repository (configuration_gazeanywhere.py, modeling_gazeanywhere.py, processing_gazeanywhere.py). Loading requires trust_remote_code=True.

Usage with 🤗 Transformers

Text-Conditioned Gaze Estimation

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_id = "IrohXu/GazeAnywhere"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

image = Image.open("your_image.jpg").convert("RGB")
text = "appearance: light brown hair girl with blue and white striped shirt"

inputs = processor(images=image, text=text, return_tensors="pt")
inputs = {k: v.to(device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_gaze_estimation(
    outputs,
    target_sizes=inputs["original_sizes"].tolist(),
    use_dark_inference=True,
)[0]

print("Gaze point (x, y):", results["gaze"])
print("Gaze in frame:", results["inout"])
print("In/out score:", results["inout_score"])
print("Head bbox [x1, y1, x2, y2]:", results["head_bbox"])

Batched Inference

images = [Image.open(p).convert("RGB") for p in ["img1.jpg", "img2.jpg"]]
texts = [
    "child in red shirt",
    "woman with glasses",
]

inputs = processor(images=images, text=texts, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_gaze_estimation(
    outputs,
    target_sizes=inputs["original_sizes"].tolist(),
    use_dark_inference=True,
)

for i, result in enumerate(results):
    print(f"Image {i}: gaze={result['gaze']}, inout={result['inout']}")

Visualization

Save an overlay with heatmap, head box, gaze line, and gaze point using the visualization script from the code repository:

python examples/hf_visualization.py \
  --model-dir IrohXu/GazeAnywhere \
  --image-path your_image.jpg \
  --text "appearance: child with golden hair" \
  --save-path visualization.jpg

Green overlays indicate in-frame gaze; red head boxes indicate out-of-frame gaze.

Inputs and Outputs

Inputs

Input	Type	Description
`images`	`PIL.Image` or `list[PIL.Image]`	RGB image(s)
`text`	`str` or `list[str]`	Appearance description of the target person. Prefixing with `"appearance: "` is recommended.

The processor resizes images to 512×512, applies ImageNet normalization, and tokenizes text with the bundled CLIP-style BPE vocabulary (max 77 tokens).

Raw model outputs

Output	Shape	Description
`heatmaps`	`(B, 1, 64, 64)`	Sigmoid gaze heatmap
`inout_logits`	`(B,)`	Sigmoid probability that gaze is inside the frame
`pred_boxes`	`(B, 4)`	Head box as normalized cx, cy, w, h in [0, 1]

Post-processed results (`post_process_gaze_estimation`)

Field	Description
`gaze`	`(x, y)` gaze point in original image pixel coordinates
`inout`	`True` if gaze is predicted inside the frame
`inout_score`	Raw in/out probability
`head_bbox`	`[x1, y1, x2, y2]` head box in original image pixels
`heatmap`	Gaze heatmap resized to original image size

Set use_dark_inference=True (default) to apply DARK sub-pixel refinement on the heatmap argmax.

Text Prompt Tips

Describe visible appearance cues: hair color, clothing, accessories, age cues.
Example: "appearance: blonde hair pulled back, black tank top, pink patterned shorts woman"
The model was trained on the appearance field from GazeAnywhere annotations; matching that style generally works best.
One text prompt corresponds to one subject per image.

Limitations

This release supports text concepts only; visual point prompts described in the paper are not yet exposed in the public inference API.
Performance depends on the quality of the text description and subject visibility.
The model is intended for research use. See the License for usage restrictions.

License

This model is released under the GazeAnywhere License. By using or distributing this model, you agree to the terms therein, including:

Acknowledging GazeAnywhere in publications that use this model
Complying with applicable laws and trade controls
Not using the model for military, warfare, or weapons development purposes

Citation

If you use GazeAnywhere in your research, please cite:

@inproceedings{cao2026gaze,
  title={Gaze Target Estimation Anywhere with Concepts},
  author={Cao, Xu and Yang, Houze and Gunda, Vipin and Zhou, Zhongyi and Xu, Tianyu and Kowdle, Adarsh and Kim, Inki and Rehg, James M},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={31304--31315},
  year={2026}
}

Acknowledgements

GazeAnywhere is developed by the UIUC Rehg Lab and Google AR. The implementation builds on DINOv3, DINO-Txt, and the Hugging Face Transformers integration pattern popularized by SAM 3.

Downloads last month: 31

Safetensors

Model size

0.9B params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support