Feature Extraction
Transformers
Safetensors
English
GAR
custom_code
File size: 905 Bytes
c6ab483
 
 
 
 
 
 
2764d18
 
3aea4ad
 
 
 
bc7cf26
3aea4ad
 
 
 
 
 
2764d18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
---
license: fair-noncommercial-research-license
language:
- en
base_model:
- facebook/Perception-LM-1B
library_name: transformers
datasets:
- HaochenWang/Grasp-Any-Region-Dataset
---

# GAR-1B

This repository contains the **GAR-1B** model, as presented in the paper [Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs](https://huggingface.co/papers/2510.18876).

**TL; DR:** Our Grasp Any Region (GAR) supports both (1) describing a single region of an image or a video in the form of points/boxes/scribbles/masks in detail and (2) understanding multiple regions such as modeling interactions and performing complex reasoning. We also release a new benchmark, GARBench, to evaluate models on advanced region-level understanding tasks.


## Usage

For detailed usage of this model, please refer to our [GitHub repo](https://github.com/Haochen-Wang409/Grasp-Any-Region).