You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

GazeAnywhere: Gaze Target Estimation Anywhere with Concepts

GazeAnywhere is a foundation model for Promptable Gaze Target Estimation (PGE). Given an image and a natural-language description of a person (a text concept), the model jointly predicts:

  • the subject's head bounding box
  • whether their gaze target is inside or outside the image frame
  • a gaze target heatmap

This checkpoint is the dinov3txt_large_text_concept variant: a frozen DINOv3 ViT-L/16 + DINO-Txt backbone with a 6-layer fusion transformer trained on large-scale in-the-wild gaze data with appearance text prompts.

Venue: CVPR 2026
Code: github.com/IrohXu/GazeAnywhere

Model Description

Estimating human gaze targets from in-the-wild images is challenging. Prior methods often rely on brittle multi-stage pipelines that require explicit head boxes or pose estimates, and they do not support flexible natural-language subject specification.

GazeAnywhere addresses this with an end-to-end, concept-driven design. You specify who to analyze with a short text prompt (e.g. "blonde hair girl in a blue striped shirt"), and the model localizes that subject and estimates where they are looking—without separate detection or pose stages.

The architecture fuses frozen vision–language features from DINOv3 + DINO-Txt through a transformer decoder and predicts three outputs in one forward pass: head box, in/out-of-frame gaze presence, and a 64×64 gaze heatmap.

Model Details

Variant gazeanywhere_dinov3txt_large_text_concept
Vision backbone DINOv3 ViT-L/16 (frozen)
Text backbone DINO-Txt (CLIP-style BPE, 77 tokens)
Fusion transformer 6 layers, dim = 512
Input resolution 512 × 512
Heatmap resolution 64 × 64
Prompt type Text appearance description
Format Hugging Face Transformers (trust_remote_code=True)

Installation

pip install "torch>=2.0" "transformers>=4.56.0" pillow opencv-python-headless

To run the visualization example from the GazeAnywhere repo, also clone the repository so the model can import its backbone code at runtime:

git clone https://github.com/IrohXu/GazeAnywhere.git
cd GazeAnywhere
pip install -r requirements.txt

Note: This model uses custom Transformers classes shipped in the repository (configuration_gazeanywhere.py, modeling_gazeanywhere.py, processing_gazeanywhere.py). Loading requires trust_remote_code=True.

Usage with 🤗 Transformers

Text-Conditioned Gaze Estimation

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_id = "IrohXu/GazeAnywhere"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

image = Image.open("your_image.jpg").convert("RGB")
text = "appearance: light brown hair girl with blue and white striped shirt"

inputs = processor(images=image, text=text, return_tensors="pt")
inputs = {k: v.to(device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_gaze_estimation(
    outputs,
    target_sizes=inputs["original_sizes"].tolist(),
    use_dark_inference=True,
)[0]

print("Gaze point (x, y):", results["gaze"])
print("Gaze in frame:", results["inout"])
print("In/out score:", results["inout_score"])
print("Head bbox [x1, y1, x2, y2]:", results["head_bbox"])

Batched Inference

images = [Image.open(p).convert("RGB") for p in ["img1.jpg", "img2.jpg"]]
texts = [
    "child in red shirt",
    "woman with glasses",
]

inputs = processor(images=images, text=texts, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_gaze_estimation(
    outputs,
    target_sizes=inputs["original_sizes"].tolist(),
    use_dark_inference=True,
)

for i, result in enumerate(results):
    print(f"Image {i}: gaze={result['gaze']}, inout={result['inout']}")

Visualization

Save an overlay with heatmap, head box, gaze line, and gaze point using the visualization script from the code repository:

python examples/hf_visualization.py \
  --model-dir IrohXu/GazeAnywhere \
  --image-path your_image.jpg \
  --text "appearance: child with golden hair" \
  --save-path visualization.jpg

Green overlays indicate in-frame gaze; red head boxes indicate out-of-frame gaze.

Inputs and Outputs

Inputs

Input Type Description
images PIL.Image or list[PIL.Image] RGB image(s)
text str or list[str] Appearance description of the target person. Prefixing with "appearance: " is recommended.

The processor resizes images to 512×512, applies ImageNet normalization, and tokenizes text with the bundled CLIP-style BPE vocabulary (max 77 tokens).

Raw model outputs

Output Shape Description
heatmaps (B, 1, 64, 64) Sigmoid gaze heatmap
inout_logits (B,) Sigmoid probability that gaze is inside the frame
pred_boxes (B, 4) Head box as normalized cx, cy, w, h in [0, 1]

Post-processed results (post_process_gaze_estimation)

Field Description
gaze (x, y) gaze point in original image pixel coordinates
inout True if gaze is predicted inside the frame
inout_score Raw in/out probability
head_bbox [x1, y1, x2, y2] head box in original image pixels
heatmap Gaze heatmap resized to original image size

Set use_dark_inference=True (default) to apply DARK sub-pixel refinement on the heatmap argmax.

Text Prompt Tips

  • Describe visible appearance cues: hair color, clothing, accessories, age cues.
  • Example: "appearance: blonde hair pulled back, black tank top, pink patterned shorts woman"
  • The model was trained on the appearance field from GazeAnywhere annotations; matching that style generally works best.
  • One text prompt corresponds to one subject per image.

Limitations

  • This release supports text concepts only; visual point prompts described in the paper are not yet exposed in the public inference API.
  • Performance depends on the quality of the text description and subject visibility.
  • The model is intended for research use. See the License for usage restrictions.

License

This model is released under the GazeAnywhere License. By using or distributing this model, you agree to the terms therein, including:

  • Acknowledging GazeAnywhere in publications that use this model
  • Complying with applicable laws and trade controls
  • Not using the model for military, warfare, or weapons development purposes

Citation

If you use GazeAnywhere in your research, please cite:

@inproceedings{cao2026gaze,
  title={Gaze Target Estimation Anywhere with Concepts},
  author={Cao, Xu and Yang, Houze and Gunda, Vipin and Zhou, Zhongyi and Xu, Tianyu and Kowdle, Adarsh and Kim, Inki and Rehg, James M},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={31304--31315},
  year={2026}
}

Acknowledgements

GazeAnywhere is developed by the UIUC Rehg Lab and Google AR. The implementation builds on DINOv3, DINO-Txt, and the Hugging Face Transformers integration pattern popularized by SAM 3.

Downloads last month
31
Safetensors
Model size
0.9B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support