MING-ZCH
/

MetaphorStar-3B

+---
+license: apache-2.0
+library_name: transformers
+tags:
+- vision-language-model
+- reinforcement-learning
+- grpo
+- metaphor-understanding
+- visual-reasoning
+base_model: Qwen/Qwen2.5-VL
+---
+# MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual RL
+**MetaphorStar** is the first Multi-modal Large Language Model (MLLM) family trained via an **End-to-End Visual Reinforcement Learning (RL)** framework specifically designed to bridge the gap between literal perception ("seeing things as they are") and metaphorical understanding ("seeing things as we are").
+Built upon the Qwen2.5-VL architecture, MetaphorStar achieves State-of-the-Art (SOTA) performance on image implication tasks and demonstrates robust generalization capabilities on complex visual reasoning benchmarks (e.g., MMMU, MathVerse).
+## 🌟 Key Highlights
+* **SOTA on Image Implication:** Significantly outperforms GPT-4o, Claude-3.5-Sonnet, and Gemini-2.5-Pro on True-False and Open-Style image implication questions.
+* **End-to-End Visual RL (TFQ-GRPO):** Utilizes the **True-False Question (TFQ)** format as a dense reward signal for Group Relative Policy Optimization (GRPO), bypassing the limitations of traditional Supervised Fine-Tuning (SFT).
+* **Overcoming the "SFT Curse":** Our research identifies that SFT warmup creates an "entropy bottleneck" that harms generalization. MetaphorStar is trained with pure RL to maintain high policy entropy, enabling creative and robust reasoning.
+* **Generalization:** Training on metaphors enhances the model's general visual reasoning ability (e.g., +16.2 points on MMMU for the 32B model compared to base).
+## 🧠 Methodology: TFQ-GRPO
+Current MLLMs struggle with metaphors because they lack the sophisticated multi-hop reasoning and Theory of Mind (ToM) required. We introduce **TFQ-GRPO**, a framework that leverages:
+1.  **TFQ-Data:** A fine-grained dataset where each image is associated with multiple True/False propositions, probing both literal content and deep implications.
+2.  **GRPO (Group Relative Policy Optimization):** An on-policy RL algorithm that optimizes reasoning trajectories based on a combined reward of **Accuracy** (correct T/F judgment) and **Format** (structured thinking process).
+3.  **Structured Reasoning:** The model is trained to explicitly output `<think>...</think>` traces before the final answer, allowing it to "find" the correct reasoning path through exploration.
+## 📊 Performance
+Evaluation on **TFQ-Bench** and the **High-Level Image Implication Benchmark (EN)**:
+| Model | TFQ (Acc) | MCQ (Acc) | OSQ (Score 0-5) |
+| :--- | :---: | :---: | :---: |
+| **MetaphorStar-32B** | **74%** | **78%** | **3.94** |
+| **MetaphorStar-7B** | **70%** | **74%** | 3.22 |
+| **MetaphorStar-3B** | 62% | 64% | 3.06 |
+| Gemini-2.5-Pro | 68% | 82% | 3.38 |
+| GPT-4o | 50% | 60% | 2.94 |
+| Claude-3.5-Sonnet | 38% | 68% | 3.22 |
+*Note: MetaphorStar-32B achieves SOTA on TFQ and OSQ, and outperforms top closed-source models on MCQ.*
+## 🚀 Quick Start
+```python
+from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
+from qwen_vl_utils import process_vision_info
+import torch
+model_id = "MING-ZCH/MetaphorStar-3B"
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    model_id, torch_dtype=torch.bfloat16, device_map="auto"
+)
+processor = AutoProcessor.from_pretrained(model_id)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "path/to/metaphor_image.jpg"},
+            {"type": "text", "text": "True-false questions: The wilted plant in the office implies a stressful working environment.\n\nFirst, describe the image, then analyze the image implication, and finally reason to get the answer. Output the thinking process in <think></think> and the final correct answer in <answer></answer> tags."}
+        ]
+    }
+]
+# Inference setup (standard Qwen2.5-VL generation)
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=[text], images=[...], padding=True, return_tensors="pt").to("cuda")
+generated_ids = model.generate(**inputs, max_new_tokens=2048)
+output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
+print(output_text)
+```
+## 📜 Citation
+```bibtex
+@article{metaphorstar2026,
+  title={MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning},
+  author={Chenhao Zhang, Yazhe Niu, Hongsheng Li},
+  journal={Anonymous},
+  year={2026}
+}
+```