|
|
--- |
|
|
license: fair-noncommercial-research-license |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- facebook/Perception-LM-8B |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- HaochenWang/Grasp-Any-Region-Dataset |
|
|
--- |
|
|
|
|
|
# GAR-8B |
|
|
|
|
|
This repository contains the **GAR-8B** model, as presented in the paper [Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs](https://huggingface.co/papers/2510.18876). |
|
|
|
|
|
**TL; DR:** Our Grasp Any Region (GAR) supports both (1) describing a single region of an image or a video in the form of points/boxes/scribbles/masks in detail and (2) understanding multiple regions such as modeling interactions and performing complex reasoning. We also release a new benchmark, GARBench, to evaluate models on advanced region-level understanding tasks. |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
For detailed usage of this model, please refer to our [GitHub repo](https://github.com/Haochen-Wang409/Grasp-Any-Region). |