GAR-8B

This repository contains the GAR-8B model, as presented in the paper Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs.

TL; DR: Our Grasp Any Region (GAR) supports both (1) describing a single region of an image or a video in the form of points/boxes/scribbles/masks in detail and (2) understanding multiple regions such as modeling interactions and performing complex reasoning. We also release a new benchmark, GARBench, to evaluate models on advanced region-level understanding tasks.

Usage

For detailed usage of this model, please refer to our GitHub repo.

Downloads last month: 16

Safetensors

Model size

10B params

Tensor type

BF16

Model tree for HaochenWang/GAR-8B

Base model

facebook/Perception-LM-8B

Finetuned

(2)

this model

Dataset used to train HaochenWang/GAR-8B

Space using HaochenWang/GAR-8B 1

Collection including HaochenWang/GAR-8B

Grasp-Any-Region

Collection

Models and datasets for Grasp-Any-Region • 4 items • Updated Oct 22, 2025 • 3

Paper for HaochenWang/GAR-8B

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Paper • 2510.18876 • Published Oct 21, 2025 • 37