TencentBAC
/

GRiP

Image-Text-to-Text

visual-grounding

multimodal-reasoning

reinforcement-learning

chain-of-thought

Model card Files Files and versions

ZhaoyangWei commited on Dec 1, 2025

Commit

24d19fb

·

verified ·

1 Parent(s): 87e92d1

Update README.md

Files changed (1) hide show

README.md +11 -6

README.md CHANGED Viewed

@@ -35,17 +35,22 @@ The core of GRiP lies in its **Policy Refinement** stage, which addresses the "C
 $$ R_{\text{total}} = R_{\text{acc}} + R_{\text{fmt}} + R_{\text{sw-IoU}} + R_{\text{MHR}} $$
 Where:
-**Salience-Weighted IoU Reward ($R_{\text{sw-IoU}}$):** Incentivizes the model to prioritize mission-critical objects over trivial distractors. It weights the recall component by an object's salience score $$s_k$$:
-    $$R_{\text{recall}} = \frac{1}{\sum s_k} \sum_{k=1}^{M} s_k \cdot \max_{i} \text{IoU}(p_i, g_k)$$
-**Multi-Heuristic Reward ($R_{\text{MHR}}$):** Encourages cognitive flexibility by rewarding diverse valid reasoning pathways (e.g., Bottom-Up, Top-Down, Deductive Verification). The model is rewarded based on similarity to the best-matching reference trajectory:
-    $$R_{\text{MHR}} = \max_{j \in \{1,2,3\}} \text{sim}(\tau_{\text{gen}}, \tau_{\text{ref}}^j)$$
-![Methodology](
 ![image](https://cdn-uploads.huggingface.co/production/uploads/66daf60cbb6e7331f46ea070/uhChByMJIAHaSC6HeeYjy.png)
-)
 ## Performance

 $$ R_{\text{total}} = R_{\text{acc}} + R_{\text{fmt}} + R_{\text{sw-IoU}} + R_{\text{MHR}} $$
 Where:
+* **Salience-Weighted IoU Reward ($R_{\text{sw-IoU}}$):** Incentivizes the model to prioritize mission-critical objects over trivial distractors. It weights the recall component by an object's salience score $s_k$:
+$$
+R_{\text{recall}} = \frac{1}{\sum s_k} \sum_{k=1}^{M} s_k \cdot \max_{i} \text{IoU}(p_i, g_k)
+$$
+* **Multi-Heuristic Reward ($R_{\text{MHR}}$):** Encourages cognitive flexibility by rewarding diverse valid reasoning pathways (e.g., Bottom-Up, Top-Down, Deductive Verification). The model is rewarded based on similarity to the best-matching reference trajectory:
+$$
+R_{\text{MHR}} = \max_{j \in \{1,2,3\}} \text{sim}(\tau_{\text{gen}}, \tau_{\text{ref}}^j)
+$$
 ![image](https://cdn-uploads.huggingface.co/production/uploads/66daf60cbb6e7331f46ea070/uhChByMJIAHaSC6HeeYjy.png)
 ## Performance