GRiP / README.md

Update README.md

24d19fb verified 2 months ago

4.85 kB

	---
	license: apache-2.0
	datasets:
	- GRiP-SFT-35K
	- GRiP-RL-37K
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: image-text-to-text
	tags:
	- visual-grounding
	- multimodal-reasoning
	- reinforcement-learning
	- chain-of-thought
	---

	# GRiP-7B: Guiding the Inner Eye

	[Arxiv](https://arxiv.org/abs/2511.22172) \| [Huggingface](https://huggingface.co/TencentBAC/GRiP)

	## Overview
	This repository contains the official model checkpoints of GRiP (Guided Reasoning and Perception), a novel visual grounded reasoning model developed by Basic Algorithm Center, Platform and Content Group, Tencent.

	Models capable of "thinking with images" represent a major leap in multimodal AI. GRiP is designed to cultivate robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. Initialized from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), GRiP employs a two-stage training framework:
	1. Bootstrapping: Structured instruction tuning to teach the syntax of grounded reasoning.
	2. Policy Refinement: A cognitive-enhanced Reinforcement Learning (RL) stage featuring novel reward mechanisms.

	GRiP achieves state-of-the-art results among open-source models on challenging benchmarks like TreeBench, *V\ Bench, and HR-Bench**, demonstrating superior capability in complex visual reasoning.

	## Methodology

	The core of GRiP lies in its Policy Refinement stage, which addresses the "Coarse Reward Problem" in existing RL methods. We introduce a multi-faceted reward architecture:

	$$ R_{\text{total}} = R_{\text{acc}} + R_{\text{fmt}} + R_{\text{sw-IoU}} + R_{\text{MHR}} $$

	Where:
	* Salience-Weighted IoU Reward ($R_{\text{sw-IoU}}$): Incentivizes the model to prioritize mission-critical objects over trivial distractors. It weights the recall component by an object's salience score $s_k$:
	$$
	R_{\text{recall}} = \frac{1}{\sum s_k} \sum_{k=1}^{M} s_k \cdot \max_{i} \text{IoU}(p_i, g_k)
	$$

	* Multi-Heuristic Reward ($R_{\text{MHR}}$): Encourages cognitive flexibility by rewarding diverse valid reasoning pathways (e.g., Bottom-Up, Top-Down, Deductive Verification). The model is rewarded based on similarity to the best-matching reference trajectory:
	$$
	R_{\text{MHR}} = \max_{j \in \{1,2,3\}} \text{sim}(\tau_{\text{gen}}, \tau_{\text{ref}}^j)
	$$




	![image](https://cdn-uploads.huggingface.co/production/uploads/66daf60cbb6e7331f46ea070/uhChByMJIAHaSC6HeeYjy.png)



	## Performance

	### TreeBench Evaluation
	TreeBench is a highly challenging benchmark for fine-grained perception and multi-step reasoning. GRiP significantly outperforms its base model and other open-source competitors.

	\| Method \| Base Model \| Overall \| mIoU \| Perception \| Reasoning \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| GPT-4o-1120 \| - \| 46.9 \| - \| - \| - \|
	\| o3-0416 \| - \| 54.8 \| - \| - \| - \|
	\| LLaVA-OneVision-72B \| LLaMA-3 \| 40.5 \| - \| 62.1 \| 53.7 \|
	\| InternVL3-78B \| InternViT \| 46.4 \| - \| 62.1 \| 61.0 \|
	\| Qwen2.5-VL-7B \| Qwen2.5 \| 37.0 \| - \| 55.2 \| 39.0 \|
	\| DeepEyes-7B \| Qwen2-VL \| 37.5 \| 30.0 \| 62.1 \| 36.6 \|
	\| Pixel-Reasoner-7B \| Qwen2-VL \| 39.0 \| 35.7 \| 58.6 \| 39.0 \|
	\| GRiP (Ours) \| Qwen2.5-VL-7B \| 51.3 \| 45.0 \| 69.1 \| 58.7 \|

	### Generalization on V* Bench and HR-Bench
	GRiP demonstrates strong generalization capabilities on attribute recognition, spatial understanding, and high-resolution reasoning.

	\| Method \| V* Bench (Overall) \| HR-Bench-4K (Overall) \| HR-Bench-8K (Overall) \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| GPT-4o-1120 \| 66.0 \| - \| - \|
	\| o3-0416 \| 95.7 \| - \| - \|
	\| Qwen2.5-VL-7B \| 74.3 \| 72.1 \| 68.8 \|
	\| Qwen2.5-VL-72B \| 84.8 \| 79.4 \| 76.3 \|
	\| DeepEyes-7B \| 90.0 \| 75.1 \| 72.6 \|
	\| GRiP (Ours) \| 91.9 \| 78.6 \| 75.0 \|

	## Train and Inference
	Please refer to our [Huggingface Repository](https://huggingface.co/TencentBAC/GRiP) for training and inference codes.

	### Training Details
	* Hardware: 8 $\times$ NVIDIA H20 (96GB) GPUs.
	* Frameworks: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for SFT, [EasyRL](https://github.com/hiyouga/EasyR1) for RL training.
	* Optimization: AdamW optimizer, GRPO algorithm for Policy Refinement.

	## Acknowledgements
	Our work is built upon the excellent [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). We also thank the developers of [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [EasyRL](https://github.com/hiyouga/EasyR1) for their efficient training frameworks.

	## Citation
	If you find our work helpful, please cite:
	```bibtex
	@article{wei2025grip,
	title={Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning},
	author={Wei, Zhaoyang and Ding, Wenchao and Hao, Yanchao and Chen, Xi},
	journal={arXiv preprint},
	year={2025}
	}