ZhaoyangWei commited on
Commit
c828e21
·
verified ·
1 Parent(s): 72cb807

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - GRiP-SFT-35K
5
+ - GRiP-RL-37K
6
+ language:
7
+ - en
8
+ base_model:
9
+ - Qwen/Qwen2.5-VL-7B-Instruct
10
+ pipeline_tag: image-text-to-text
11
+ tags:
12
+ - visual-grounding
13
+ - multimodal-reasoning
14
+ - reinforcement-learning
15
+ - chain-of-thought
16
+ ---
17
+
18
+ # GRiP-7B: Guiding the Inner Eye
19
+
20
+ [Arxiv](https://arxiv.org/abs/2511.22172) | [Huggingface](https://huggingface.co/TencentBAC/GRiP)
21
+
22
+ ## Overview
23
+ This repository contains the official model checkpoints of **GRiP (Guided Reasoning and Perception)**, a novel visual grounded reasoning model developed by Basic Algorithm Center, Platform and Content Group, Tencent.
24
+
25
+ Models capable of "thinking with images" represent a major leap in multimodal AI. **GRiP** is designed to cultivate robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. Initialized from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), GRiP employs a two-stage training framework:
26
+ 1. **Bootstrapping:** Structured instruction tuning to teach the syntax of grounded reasoning.
27
+ 2. **Policy Refinement:** A cognitive-enhanced Reinforcement Learning (RL) stage featuring novel reward mechanisms.
28
+
29
+ GRiP achieves state-of-the-art results among open-source models on challenging benchmarks like **TreeBench**, **V\* Bench**, and **HR-Bench**, demonstrating superior capability in complex visual reasoning.
30
+
31
+ ## Methodology
32
+
33
+ The core of GRiP lies in its **Policy Refinement** stage, which addresses the "Coarse Reward Problem" in existing RL methods. We introduce a multi-faceted reward architecture:
34
+
35
+ $$ R_{\text{total}} = R_{\text{acc}} + R_{\text{fmt}} + R_{\text{sw-IoU}} + R_{\text{MHR}} $$
36
+
37
+ Where:
38
+ * **Salience-Weighted IoU Reward ($R_{\text{sw-IoU}}$):** Incentivizes the model to prioritize mission-critical objects over trivial distractors. It weights the recall component by an object's salience score $s_k$:
39
+ $$ R_{\text{recall}} = \frac{1}{\sum s_k} \sum_{k=1}^{M} s_k \cdot \max_{i} \text{IoU}(p_i, g_k) $$
40
+ * **Multi-Heuristic Reward ($R_{\text{MHR}}$):** Encourages cognitive flexibility by rewarding diverse valid reasoning pathways (e.g., Bottom-Up, Top-Down, Deductive Verification). The model is rewarded based on similarity to the best-matching reference trajectory:
41
+ $$ R_{\text{MHR}} = \max_{j \in \{1,2,3\}} \text{sim}(\tau_{\text{gen}}, \tau_{\text{ref}}^j) $$
42
+
43
+
44
+ ![Methodology](
45
+
46
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/66daf60cbb6e7331f46ea070/uhChByMJIAHaSC6HeeYjy.png)
47
+
48
+ )
49
+
50
+ ## Performance
51
+
52
+ ### TreeBench Evaluation
53
+ TreeBench is a highly challenging benchmark for fine-grained perception and multi-step reasoning. GRiP significantly outperforms its base model and other open-source competitors.
54
+
55
+ | Method | Base Model | Overall | mIoU | Perception | Reasoning |
56
+ | :--- | :--- | :--- | :--- | :--- | :--- |
57
+ | GPT-4o-1120 | - | 46.9 | - | - | - |
58
+ | o3-0416 | - | 54.8 | - | - | - |
59
+ | LLaVA-OneVision-72B | LLaMA-3 | 40.5 | - | 62.1 | 53.7 |
60
+ | InternVL3-78B | InternViT | 46.4 | - | 62.1 | 61.0 |
61
+ | Qwen2.5-VL-7B | Qwen2.5 | 37.0 | - | 55.2 | 39.0 |
62
+ | DeepEyes-7B | Qwen2-VL | 37.5 | 30.0 | 62.1 | 36.6 |
63
+ | Pixel-Reasoner-7B | Qwen2-VL | 39.0 | 35.7 | 58.6 | 39.0 |
64
+ | **GRiP (Ours)** | **Qwen2.5-VL-7B** | **51.3** | **45.0** | **69.1** | **58.7** |
65
+
66
+ ### Generalization on V* Bench and HR-Bench
67
+ GRiP demonstrates strong generalization capabilities on attribute recognition, spatial understanding, and high-resolution reasoning.
68
+
69
+ | Method | V* Bench (Overall) | HR-Bench-4K (Overall) | HR-Bench-8K (Overall) |
70
+ | :--- | :--- | :--- | :--- |
71
+ | GPT-4o-1120 | 66.0 | - | - |
72
+ | o3-0416 | 95.7 | - | - |
73
+ | Qwen2.5-VL-7B | 74.3 | 72.1 | 68.8 |
74
+ | Qwen2.5-VL-72B | 84.8 | 79.4 | 76.3 |
75
+ | DeepEyes-7B | 90.0 | 75.1 | 72.6 |
76
+ | **GRiP (Ours)** | **91.9** | **78.6** | **75.0** |
77
+
78
+ ## Train and Inference
79
+ Please refer to our [Huggingface Repository](https://huggingface.co/TencentBAC/GRiP) for training and inference codes.
80
+
81
+ ### Training Details
82
+ * **Hardware:** 8 $\times$ NVIDIA H20 (96GB) GPUs.
83
+ * **Frameworks:** [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for SFT, [EasyRL](https://github.com/hiyouga/EasyR1) for RL training.
84
+ * **Optimization:** AdamW optimizer, GRPO algorithm for Policy Refinement.
85
+
86
+ ## Acknowledgements
87
+ Our work is built upon the excellent [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). We also thank the developers of [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [EasyRL](https://github.com/hiyouga/EasyR1) for their efficient training frameworks.
88
+
89
+ ## Citation
90
+ If you find our work helpful, please cite:
91
+ ```bibtex
92
+ @article{wei2025grip,
93
+ title={Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning},
94
+ author={Wei, Zhaoyang and Ding, Wenchao and Hao, Yanchao and Chen, Xi},
95
+ journal={arXiv preprint},
96
+ year={2025}
97
+ }