Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- GRiP-SFT-35K
|
| 5 |
+
- GRiP-RL-37K
|
| 6 |
+
language:
|
| 7 |
+
- en
|
| 8 |
+
base_model:
|
| 9 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 10 |
+
pipeline_tag: image-text-to-text
|
| 11 |
+
tags:
|
| 12 |
+
- visual-grounding
|
| 13 |
+
- multimodal-reasoning
|
| 14 |
+
- reinforcement-learning
|
| 15 |
+
- chain-of-thought
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# GRiP-7B: Guiding the Inner Eye
|
| 19 |
+
|
| 20 |
+
[Arxiv](https://arxiv.org/abs/2511.22172) | [Huggingface](https://huggingface.co/TencentBAC/GRiP)
|
| 21 |
+
|
| 22 |
+
## Overview
|
| 23 |
+
This repository contains the official model checkpoints of **GRiP (Guided Reasoning and Perception)**, a novel visual grounded reasoning model developed by Basic Algorithm Center, Platform and Content Group, Tencent.
|
| 24 |
+
|
| 25 |
+
Models capable of "thinking with images" represent a major leap in multimodal AI. **GRiP** is designed to cultivate robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. Initialized from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), GRiP employs a two-stage training framework:
|
| 26 |
+
1. **Bootstrapping:** Structured instruction tuning to teach the syntax of grounded reasoning.
|
| 27 |
+
2. **Policy Refinement:** A cognitive-enhanced Reinforcement Learning (RL) stage featuring novel reward mechanisms.
|
| 28 |
+
|
| 29 |
+
GRiP achieves state-of-the-art results among open-source models on challenging benchmarks like **TreeBench**, **V\* Bench**, and **HR-Bench**, demonstrating superior capability in complex visual reasoning.
|
| 30 |
+
|
| 31 |
+
## Methodology
|
| 32 |
+
|
| 33 |
+
The core of GRiP lies in its **Policy Refinement** stage, which addresses the "Coarse Reward Problem" in existing RL methods. We introduce a multi-faceted reward architecture:
|
| 34 |
+
|
| 35 |
+
$$ R_{\text{total}} = R_{\text{acc}} + R_{\text{fmt}} + R_{\text{sw-IoU}} + R_{\text{MHR}} $$
|
| 36 |
+
|
| 37 |
+
Where:
|
| 38 |
+
* **Salience-Weighted IoU Reward ($R_{\text{sw-IoU}}$):** Incentivizes the model to prioritize mission-critical objects over trivial distractors. It weights the recall component by an object's salience score $s_k$:
|
| 39 |
+
$$ R_{\text{recall}} = \frac{1}{\sum s_k} \sum_{k=1}^{M} s_k \cdot \max_{i} \text{IoU}(p_i, g_k) $$
|
| 40 |
+
* **Multi-Heuristic Reward ($R_{\text{MHR}}$):** Encourages cognitive flexibility by rewarding diverse valid reasoning pathways (e.g., Bottom-Up, Top-Down, Deductive Verification). The model is rewarded based on similarity to the best-matching reference trajectory:
|
| 41 |
+
$$ R_{\text{MHR}} = \max_{j \in \{1,2,3\}} \text{sim}(\tau_{\text{gen}}, \tau_{\text{ref}}^j) $$
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+

|
| 47 |
+
|
| 48 |
+
)
|
| 49 |
+
|
| 50 |
+
## Performance
|
| 51 |
+
|
| 52 |
+
### TreeBench Evaluation
|
| 53 |
+
TreeBench is a highly challenging benchmark for fine-grained perception and multi-step reasoning. GRiP significantly outperforms its base model and other open-source competitors.
|
| 54 |
+
|
| 55 |
+
| Method | Base Model | Overall | mIoU | Perception | Reasoning |
|
| 56 |
+
| :--- | :--- | :--- | :--- | :--- | :--- |
|
| 57 |
+
| GPT-4o-1120 | - | 46.9 | - | - | - |
|
| 58 |
+
| o3-0416 | - | 54.8 | - | - | - |
|
| 59 |
+
| LLaVA-OneVision-72B | LLaMA-3 | 40.5 | - | 62.1 | 53.7 |
|
| 60 |
+
| InternVL3-78B | InternViT | 46.4 | - | 62.1 | 61.0 |
|
| 61 |
+
| Qwen2.5-VL-7B | Qwen2.5 | 37.0 | - | 55.2 | 39.0 |
|
| 62 |
+
| DeepEyes-7B | Qwen2-VL | 37.5 | 30.0 | 62.1 | 36.6 |
|
| 63 |
+
| Pixel-Reasoner-7B | Qwen2-VL | 39.0 | 35.7 | 58.6 | 39.0 |
|
| 64 |
+
| **GRiP (Ours)** | **Qwen2.5-VL-7B** | **51.3** | **45.0** | **69.1** | **58.7** |
|
| 65 |
+
|
| 66 |
+
### Generalization on V* Bench and HR-Bench
|
| 67 |
+
GRiP demonstrates strong generalization capabilities on attribute recognition, spatial understanding, and high-resolution reasoning.
|
| 68 |
+
|
| 69 |
+
| Method | V* Bench (Overall) | HR-Bench-4K (Overall) | HR-Bench-8K (Overall) |
|
| 70 |
+
| :--- | :--- | :--- | :--- |
|
| 71 |
+
| GPT-4o-1120 | 66.0 | - | - |
|
| 72 |
+
| o3-0416 | 95.7 | - | - |
|
| 73 |
+
| Qwen2.5-VL-7B | 74.3 | 72.1 | 68.8 |
|
| 74 |
+
| Qwen2.5-VL-72B | 84.8 | 79.4 | 76.3 |
|
| 75 |
+
| DeepEyes-7B | 90.0 | 75.1 | 72.6 |
|
| 76 |
+
| **GRiP (Ours)** | **91.9** | **78.6** | **75.0** |
|
| 77 |
+
|
| 78 |
+
## Train and Inference
|
| 79 |
+
Please refer to our [Huggingface Repository](https://huggingface.co/TencentBAC/GRiP) for training and inference codes.
|
| 80 |
+
|
| 81 |
+
### Training Details
|
| 82 |
+
* **Hardware:** 8 $\times$ NVIDIA H20 (96GB) GPUs.
|
| 83 |
+
* **Frameworks:** [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for SFT, [EasyRL](https://github.com/hiyouga/EasyR1) for RL training.
|
| 84 |
+
* **Optimization:** AdamW optimizer, GRPO algorithm for Policy Refinement.
|
| 85 |
+
|
| 86 |
+
## Acknowledgements
|
| 87 |
+
Our work is built upon the excellent [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). We also thank the developers of [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [EasyRL](https://github.com/hiyouga/EasyR1) for their efficient training frameworks.
|
| 88 |
+
|
| 89 |
+
## Citation
|
| 90 |
+
If you find our work helpful, please cite:
|
| 91 |
+
```bibtex
|
| 92 |
+
@article{wei2025grip,
|
| 93 |
+
title={Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning},
|
| 94 |
+
author={Wei, Zhaoyang and Ding, Wenchao and Hao, Yanchao and Chen, Xi},
|
| 95 |
+
journal={arXiv preprint},
|
| 96 |
+
year={2025}
|
| 97 |
+
}
|