Instructions to use yuuko-eth/LocateAnything-3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yuuko-eth/LocateAnything-3B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="yuuko-eth/LocateAnything-3B-GGUF",
	filename="LocateAnything-3B-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use yuuko-eth/LocateAnything-3B-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use yuuko-eth/LocateAnything-3B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yuuko-eth/LocateAnything-3B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yuuko-eth/LocateAnything-3B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Ollama
How to use yuuko-eth/LocateAnything-3B-GGUF with Ollama:
```
ollama run hf.co/yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M
```

Unsloth Studio

How to use yuuko-eth/LocateAnything-3B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for yuuko-eth/LocateAnything-3B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for yuuko-eth/LocateAnything-3B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for yuuko-eth/LocateAnything-3B-GGUF to start chatting

How to use yuuko-eth/LocateAnything-3B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use yuuko-eth/LocateAnything-3B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use yuuko-eth/LocateAnything-3B-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use yuuko-eth/LocateAnything-3B-GGUF with Docker Model Runner:
```
docker model run hf.co/yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M
```

Lemonade

How to use yuuko-eth/LocateAnything-3B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull yuuko-eth/LocateAnything-3B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.LocateAnything-3B-GGUF-Q4_K_M

List all available models

lemonade list

LocateAnything-3B - GGUF

Model creator: NVIDIA
Original model: nvidia/LocateAnything-3B

GGUF conversions of NVIDIA's LocateAnything-3B (MoonViT-SO-400M vision tower + Eagle MLP + Qwen2.5-3B) for llama.cpp and its multimodal (mtmd) stack. The model is a vision-language grounding model: given an image and a natural-language query, it emits a label plus normalized (0–1000) bounding-box or point coordinates.

License: NVIDIA License — non-commercial / research use only (commercial use is reserved to NVIDIA and its affiliates). A complete copy of the upstream LICENSE is bundled in this repo and must be retained on redistribution, along with attribution to NVIDIA. See the original model card below.

About GGUF

GGUF is the model file format used by llama.cpp. Each conversion has two halves and both are required:

the LLM half (LocateAnything-3B-{quant}.gguf), and
the mmproj vision projector (mmproj-LocateAnything-3B-BF16.gguf).

You pair one LLM quant with the mmproj.

Provided files

File	Quant	Size	Accuracy vs BF16	Notes
`LocateAnything-3B-BF16.gguf`	BF16	6.4 GB	baseline	reference
`LocateAnything-3B-Q8_0.gguf`	Q8_0	3.6 GB	≤0.5 norm-units	minimal quant noise
`LocateAnything-3B-Q6_K.gguf`	Q6_K	2.8 GB	sub-pixel	near-BF16 quality
`LocateAnything-3B-Q5_K_M.gguf`	Q5_K_M	2.4 GB	sub-pixel	safety margin
`LocateAnything-3B-Q4_K_M.gguf`	Q4_K_M	2.1 GB	sub-pixel	recommended default
`mmproj-LocateAnything-3B-BF16.gguf`	BF16	0.87 GB	—	vision projector (required)

Accuracy figures are the mean absolute coordinate delta vs BF16 across a 5-landmark macOS-screenshot grounding suite (0–1000 normalized space), greedy decoding. Q8_0 ≤ 0.5; Q6_K / Q5_K_M / Q4_K_M all sub-pixel.

Low-bit IQ2_M / IQ3_M are intentionally not shipped: a good calibration (imatrix) for a vision-grounding model needs image input, which the current llama.cpp imatrix tool doesn't support, and uncalibrated IQ2 regresses noticeably.

How to run

Requires the fork build. The LocateAnything mtmd integration is not yet in upstream llama.cpp. Build from github.com/yuuko-eth/llama.cpp @ mtmd-grounders — stock llama.cpp will not load these GGUFs.

Important — llama-server needs --special. Grounding output is emitted as control tokens (<ref>, </ref>, <box>, and coordinate tokens <0>…<999>). llama-server drops control tokens from responses unless launched with --special; without it you only get the plain-text label (e.g. "Apple logo") with no coordinates. llama-mtmd-cli shows them by default.

llama-mtmd-cli

llama-mtmd-cli \
    -m LocateAnything-3B-Q4_K_M.gguf \
    --mmproj mmproj-LocateAnything-3B-BF16.gguf \
    --image screenshot.jpg \
    -p "Locate the Apple logo." \
    -ngl 99
# -> <ref>Apple logo</ref><box><12><1><25><22></box>

llama-server

llama-server \
    -m LocateAnything-3B-Q4_K_M.gguf \
    --mmproj mmproj-LocateAnything-3B-BF16.gguf \
    -ngl 99 --special                       # <-- required for grounding tokens

Then call /v1/chat/completions (OpenAI-style, image as a data: URL) or /completion. Output matches the CLI.

Tip: cost is dominated by image encoding, which scales with resolution. Capping the longer image side to ~1280 px keeps grounding quality (coordinates are normalized) while cutting a multi-second encode to well under a second.

Prompt templates (from upstream)

Task	Prompt	Output
Box grounding	`Locate the {phrase}.`	`<box><x1><y1><x2><y2></box>`
Region / GUI box	`Locate the region that matches the following description: {phrase}.`	bbox
Point / GUI point	`Point to: {phrase}.`	`<box><x><y></box>`
Multi-instance	`Locate all the instances that match the following description: {phrase}.`	many bboxes
Text grounding	`Please locate the text referred as {phrase}.`	bbox

Parsing the output

import re

PAT = re.compile(r"<ref>(.*?)</ref><box>((?:<\d+>)+)</box>")

def parse(text):
    results = []
    for label, coords in PAT.findall(text):
        nums = [int(n) for n in re.findall(r"<(\d+)>", coords)]  # normalized 0-1000
        results.append({"label": label, "coords": nums})         # 4 = bbox, 2 = point
    return results

Known issues

llama-server requires --special (see above) — the single most common gotcha.
Vision head_dim=72: some CUDA flash-attention backends assert on this head dim. llama.cpp auto-falls back, but if your backend asserts, add --no-flash-attn.
Tested single-image / single-turn. Multi-image batching in llama-server is not validated.

Provenance

Derived from nvidia/LocateAnything-3B.
GGUF conversion + mtmd integration: github.com/yuuko-eth/llama.cpp @ mtmd-grounders (fork of llama.cpp; not yet upstreamed).
AI-assistance disclosure: the GGUF conversion and packaging were assisted by Claude Code.

Original model card: nvidia/LocateAnything-3B

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything teaser

🔗 Quick Links

🚀 Online Demo: LocateAnything (Hugging Face Spaces)
💻 GitHub Code: NVlabs/Eagle/Embodied
📄 Paper: arXiv:2605.27365

Model Overview

Description:

LocateAnything is a vision-language model for fast and high-quality visual grounding, enabling precise object localization, dense detection, and point-based localization across diverse domains in both Enterprise Intelligence and Physical AI. The model adopts a generalist design, supporting tasks such as referring expression grounding, multi-object detection, GUI element grounding, and text localization, with strong performance in complex and cluttered scenes.

Its core innovation, Parallel Box Decoding (PBD), predicts complete bounding box coordinates in a single parallel step rather than autoregressive token-by-token decoding, improving efficiency while preserving geometric consistency. This enables up to 2.5× higher throughput compared to prior approaches.

The model is trained on a large-scale multi-domain dataset (12M images, 138M+ queries, 785M bounding boxes) spanning natural scenes, robotics, driving, GUI interaction, and document understanding. It serves as a foundation for generalist multimodal perception and has been integrated into NVIDIA’s frontier production-grade vision-language models, such as Nemotron 3 Nano Omni, supporting grounding, GUI understanding, and multimodal agentic capabilities.

LocateAnything is developed as part of the Eagle VLM model family. This model is for research and development only.

Demo Videos

License/Terms of Use:

This model is released under the NVIDIA License for non-commercial use, which permits use, reproduction, and modification for academic and non-profit research purposes only. Commercial use is not permitted, except by NVIDIA and its affiliates. Redistribution must retain the license and all applicable copyright and attribution notices. The model is provided “as is” without warranty of any kind, and users assume all associated risks.

This model is built using components from third-party models with their respective licenses:

Language model: Qwen2.5-3B-Instruct (Qwen Research License)
Vision encoder: MoonViT-SO-400M (MIT License)

Models are improved using Qwen.

Deployment Geography:

Global

Use Case:

LocateAnything-3B is intended for developers and researchers building vision-language models and applications that require fast and precise visual localization from natural language instructions.

Supported use cases include:

Open-set, common, and long-tail object detection
Dense multi-object detection in cluttered scenes
Phrase and referring-expression grounding
Automated dataset labeling and annotation (e.g., detection, grounding, pointing)
GUI element grounding for interactive and agentic systems
Robotics and autonomous driving perception
Document understanding, layout grounding, and OCR localization
Industrial inspection, surveillance, and remote sensing applications
Point-based localization and fine-grained spatial reasoning

Release Date [Insert the expected release date below]:

Github [05/26/2026] via https://github.com/NVlabs/Eagle/tree/main/Embodied.
Hugging Face [05/26/2026] via https://huggingface.co/nvidia/LocateAnything-3B.
Demo [05/26/2026] via https://huggingface.co/spaces/nvidia/LocateAnything.
Webpage [05/26/2026] via https://research.nvidia.com/labs/lpr/locate-anything/.
Tech Report [05/26/2026] via https://research.nvidia.com/labs/lpr/locate-anything/LocateAnything.pdf

References(s):

Wang et al., LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding, NVIDIA Tech Report, 2026
Kimi Team, Kimi-VL Technical Report, arXiv:2504.07491, 2025.
Qwen Team, Qwen2.5: A Party of Foundation Models, Qwen Blog, 2024.
Chen et al., Pix2Seq: A Language Modeling Framework for Object Detection, ICLR, 2022.
Jiang et al., Detect Anything via Next Point Prediction, arXiv:2510.12798, 2025.
Liu et al., Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, arXiv:2303.05499, 2023.
Lin et al., Microsoft COCO: Common Objects in Context, ECCV, 2014.
Gupta et al., LVIS: A Dataset for Large Vocabulary Instance Segmentation, CVPR, 2019.
Li et al., ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use, ACM MM, 2025.

Model Architecture:

Architecture Type: Transformer-based vision-language model (VLM).

Network Architecture: Native-resolution VLM with the following components:

Vision encoder: MoonViT
Language model: Qwen2.5-3B-Instruct
Multimodal projector: MLP projector
Output formulation: Block-based structure for visual grounding

Number of model parameters: 3B.

LocateAnything extends a vision-language model with Parallel Box Decoding (PBD), a block-wise multi-token prediction framework for efficient visual grounding. Instead of autoregressive coordinate generation, the model predicts complete bounding boxes and points in parallel structured units, improving decoding efficiency while preserving geometric consistency. The architecture jointly optimizes next-token prediction and multi-token prediction to balance reasoning ability and parallel inference. Training follows a four-stage pipeline: initial multimodal knowledge adaptation using captioning, VQA, OCR, and related data, followed by grounding and dense-scene localization fine-tuning.

Input(s):

Input Type(s): Image and Text.

Input Format(s):

Image: RGB image input with original source resolution.
Text: Natural-language prompt or task template, such as object categories, referring expressions, GUI instructions, OCR/layout requests, or pointing queries.

Input Parameters:

Image: Two-Dimensional (2D)
Text: One-Dimensional (1D)

Other Properties Related to Input:

Production image resolution supports up to 2.5K.
Prompt length supports up to 24K tokens.
Training detection and grounding stages use a maximum sequence length of 25,600 tokens.
Inference supports up to 8,192 newly generated tokens.

Output(s):

Output Type(s): Text.

Output Format(s):

Text: Model-generated token sequence containing semantic labels and structured coordinate tokens, such as bounding boxes (<box> x1, y1, x2, y2 </box>) and points (<box> x, y </box>).

Output Parameters:

Text: One-Dimensional (1D)
Bounding boxes/points: Two-Dimensional (2D) spatial coordinates

Other Properties Related to Output:

Outputs are organized into fixed-length blocks (length 6), including Semantic, Box, Negative, and End blocks.
A Box block encodes quantized spatial coordinates with structural tokens; unused positions are padded with <null>.
Fast Mode predicts box-aligned blocks in parallel; Slow Mode uses autoregressive decoding; Hybrid Mode defaults to parallel decoding with fallback to autoregressive decoding for format irregularity or spatial ambiguity.

Our AI models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves improved training and inference performance compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

Transformers. The inference setup uses standard VLM generation with BF16 precision and KV cache. TensorRT, TensorRT-LLM, and Triton are not yet supported.

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere (e.g., A100)
NVIDIA Blackwell
NVIDIA Hopper (e.g., H100)
NVIDIA Lovelace (e.g., L40, RTX 4090)

Deployment on embedded platforms such as NVIDIA Thor is possible with additional model optimization, including quantization, compression, or distillation. Other architectures may be supported depending on available memory, precision support, and software configuration.

Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

LocateAnything-3B: 3B-parameter research model variant evaluated in Hybrid Mode by default. Fast, Hybrid, and Slow inference modes are supported by the same model formulation.

LocateAnything-3B can be integrated into systems that require spatial grounding from natural language, such as GUI agents, robotics/embodied agents, document-understanding pipelines, OCR/text localization, and open-world detection workflows.

Training, Testing, and Evaluation Datasets:

Data Modality:

Image and Text.

Image
Text

Training Data Size:

Image Training Data Size:

1 Million to 1 Billion Images - 12M unique images.

Text Training Data Size:

1 Billion to 10 Trillion Tokens - Derived from approximately 140M natural-language queries.

Data Collection Method by dataset:

Hybrid: Human, Automated
Data is collected from human-curated and open-source datasets, as well as automated ingestion of publicly available data sources.

Labeling Method by dataset:

Hybrid: Human, Synthetic, Automated
Labeling includes original human or open-source annotations, along with model-assisted and synthetic annotation generation using Qwen3-VL, Molmo, SAM 3, and Rex-Omni, with automated post-verification.

Properties: The training data consists of supervised fine-tuning (SFT) datasets with multimodal inputs, primarily image-text pairs and structured annotations such as bounding boxes, points, and negative samples.

The data spans multiple domains, including grounding, open-world grounding, general and dense object detection, scene text detection, GUI understanding and grounding, document layout understanding, and OCR.

Modalities include visual inputs (images) and natural-language queries or instructions. The dataset is derived from a mixture of publicly available academic datasets, along with model-assisted and synthetic annotations. It may include publicly available and potentially copyrighted content; users are responsible for ensuring compliance with applicable usage rights.

The linguistic content primarily consists of short, task-oriented natural-language expressions, such as object categories, referring expressions, GUI instructions, OCR queries, and grounding prompts, typically in English.

Evaluation Dataset:

Data Collection Method by dataset:

Hybrid: Human, Automated

Labeling Method by dataset:

Hybrid: Human, Synthetic, Automated

Properties: The evaluation datasets consist of publicly available benchmarks spanning visual grounding, object detection, document understanding, scene text detection, and GUI-related tasks. Modalities include image inputs paired with natural-language queries and structured annotations such as bounding boxes and points.

The evaluation suite covers both box-level and point-level grounding tasks, with approximately 48K images for box evaluation and 35K images for point evaluation across multiple datasets. These datasets span diverse domains including natural scenes, documents, aerial imagery, and human-centric interactions, enabling comprehensive assessment of localization accuracy and robustness.

Evaluation queries are typically short, task-oriented natural-language expressions such as referring phrases, object categories, and grounding prompts.

Performance is measured using box-based F1 at IoU thresholds of 0.5 and 0.95, as well as mean IoU for detection, layout, and OCR tasks. Point-based localization is evaluated based on whether predicted points fall within ground-truth segmentation masks or bounding boxes. Inference efficiency is reported in boxes per second (BPS) on a single NVIDIA H100 GPU with batch size 1.

Quantitative Evaluation Benchmarks

General Object Detection

Dense Object Detection

GUI Understanding

Layout Grounding and OCR

Referring Expression Grounding

Pointing

Inference:

Test Hardware: H100

We suggest using max_new_tokens=8192 and generation_mode="hybrid" to avoid truncated response and balance speed with accuracy.

Installation

pip install opencv-python-headless==4.11.0.86 transformers==4.57.1 numpy==1.25.0 Pillow==11.1.0 peft torchvision decord==0.6.0 lmdb==1.7.5

PyTorch (torch) must be installed separately according to your CUDA version. See pytorch.org/get-started.

Optional — MagiAttention (Hopper / Blackwell GPUs only, recommended for faster MTP inference):

git clone https://github.com/SandAI-org/MagiAttention.git
cd MagiAttention
git checkout v1.0.5
git submodule update --init --recursive
pip install -r requirements.txt
pip install --no-build-isolation .

If MagiAttention is installed, the model will automatically use it for efficient MTP block-diffusion attention. If not installed, it will fall back to PyTorch SDPA — fully functional but slower for MTP decoding.

Worker (recommended)

Below is a self-contained worker that loads the model once and serves perception queries via a unified predict() plus task-specific convenience methods. You can drop this class into any FastAPI / gRPC / Triton serving framework.

import re
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoProcessor


class LocateAnythingWorker:
    """Stateful worker that loads the model once and serves perception queries."""

    def __init__(self, model_path: str, device: str = "cuda", dtype=torch.bfloat16):
        self.device = device
        self.dtype = dtype

        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
        self.model = AutoModel.from_pretrained(
            model_path,
            torch_dtype=dtype,
            trust_remote_code=True,
        ).to(device).eval()

    @torch.no_grad()
    def predict(
        self,
        image: Image.Image,
        question: str,
        generation_mode: str = "hybrid",   # "fast" (MTP) | "slow" (NTP/AR) | "hybrid"
        max_new_tokens: int = 2048,
        temperature: float = 0.7,
        verbose: bool = True,
    ) -> dict:
        messages = [
            {"role": "user", "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": question},
            ]}
        ]

        text = self.processor.py_apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        images, videos = self.processor.process_vision_info(messages)
        inputs = self.processor(
            text=[text], images=images, videos=videos, return_tensors="pt"
        ).to(self.device)

        pixel_values = inputs["pixel_values"].to(self.dtype)
        input_ids = inputs["input_ids"]
        image_grid_hws = inputs.get("image_grid_hws", None)

        response = self.model.generate(
            pixel_values=pixel_values,
            input_ids=input_ids,
            attention_mask=inputs["attention_mask"],
            image_grid_hws=image_grid_hws,
            tokenizer=self.tokenizer,
            max_new_tokens=max_new_tokens,
            use_cache=True,
            generation_mode=generation_mode,
            temperature=temperature,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.1,
            verbose=verbose,
        )

        result = {"answer": response[0] if isinstance(response, tuple) else response}
        if isinstance(response, tuple) and len(response) >= 3:
            result["history"] = response[1]
            result["stats"] = response[2]
        return result

    # ---- Convenience methods for each task ----

    def detect(self, image: Image.Image, categories: list[str], **kwargs) -> dict:
        """Object detection / document layout analysis."""
        cats = "</c>".join(categories)
        prompt = f"Locate all the instances that matches the following description: {cats}."
        return self.predict(image, prompt, **kwargs)

    def ground_single(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Phrase grounding — single instance."""
        prompt = f"Locate a single instance that matches the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def ground_multi(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Phrase grounding — multiple instances."""
        prompt = f"Locate all the instances that match the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def ground_text(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Text grounding."""
        prompt = f"Please locate the text referred as {phrase}."
        return self.predict(image, prompt, **kwargs)

    def detect_text(self, image: Image.Image, **kwargs) -> dict:
        """Scene text detection."""
        prompt = "Detect all the text in box format."
        return self.predict(image, prompt, **kwargs)

    def ground_gui(self, image: Image.Image, phrase: str, output_type: str = "box", **kwargs) -> dict:
        """GUI grounding (box or point)."""
        if output_type == "point":
            prompt = f"Point to: {phrase}."
        else:
            prompt = f"Locate the region that matches the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def point(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Pointing."""
        prompt = f"Point to: {phrase}."
        return self.predict(image, prompt, **kwargs)

    # ---- Utility: parse model output ----

    @staticmethod
    def parse_boxes(answer: str, image_width: int, image_height: int) -> list[dict]:
        """Parse model output into pixel-coordinate bounding boxes.

        Coordinates in model output are normalized integers in [0, 1000].
        """
        boxes = []
        for m in re.finditer(r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>", answer):
            x1, y1, x2, y2 = [int(g) for g in m.groups()]
            boxes.append({
                "x1": x1 / 1000 * image_width,
                "y1": y1 / 1000 * image_height,
                "x2": x2 / 1000 * image_width,
                "y2": y2 / 1000 * image_height,
            })
        return boxes

    @staticmethod
    def parse_points(answer: str, image_width: int, image_height: int) -> list[dict]:
        """Parse model output into pixel-coordinate points."""
        points = []
        for m in re.finditer(r"<box><(\d+)><(\d+)></box>", answer):
            x, y = int(m.group(1)), int(m.group(2))
            points.append({
                "x": x / 1000 * image_width,
                "y": y / 1000 * image_height,
            })
        return points

Usage Example

worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("example.jpg").convert("RGB")

# Object Detection
result = worker.detect(img, ["person", "car", "bicycle"])
print("Detection:", result["answer"])

# Phrase Grounding (multiple)
result = worker.ground_multi(img, "people wearing red shirts")
print("Grounding:", result["answer"])

# Scene Text Detection
result = worker.detect_text(img)
print("Text Detection:", result["answer"])

# Pointing
result = worker.point(img, "the traffic light")
print("Pointing:", result["answer"])

# GUI Grounding (point)
result = worker.ground_gui(img, "the search button", output_type="point")
print("GUI Point:", result["answer"])

# Parse structured output into pixel coordinates
w, h = img.size
boxes = LocateAnythingWorker.parse_boxes(result["answer"], w, h)
points = LocateAnythingWorker.parse_points(result["answer"], w, h)

Supported Tasks & Prompt Templates

Task	Worker Method	Output	Prompt Template
Object Detection	`worker.detect(img, [...])`	Box	`Locate all the instances that matches the following description: [CATEGORIES].`
Phrase Grounding (single)	`worker.ground_single(img, phrase)`	Single Box	`Locate a single instance that matches the following description: [PHRASE].`
Phrase Grounding (multi)	`worker.ground_multi(img, phrase)`	Multiple Boxes	`Locate all the instances that match the following description: [PHRASE].`
Text Grounding	`worker.ground_text(img, phrase)`	Box	`Please locate the text referred as [PHRASE].`
Scene Text Detection	`worker.detect_text(img)`	Box	`Detect all the text in box format.`
Document Layout Analysis	`worker.detect(img, [...])`	Box	`Locate all the instances that matches the following description: [CATEGORIES].`
GUI Grounding (box)	`worker.ground_gui(img, phrase, "box")`	Box	`Locate the region that matches the following description: [PHRASE].`
GUI Grounding (point) / Pointing	`worker.ground_gui(img, phrase, "point")` / `worker.point(img, phrase)`	Point	`Point to: [PHRASE].`

[PHRASE] is a free-form natural-language description; [CATEGORIES] is a comma-separated list (multiple categories may also be joined with </c>).

Generation Modes

Mode	Description	Speed	Accuracy
`fast`	MTP only, never falls back to AR	Fastest	Good for simple scenes
`slow`	Pure auto-regressive decoding	Slowest	Most robust
`hybrid` (default)	MTP first, falls back to AR on uncertain boxes, switches back after box boundary	Balanced	Best overall

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.