MING-ZCH commited on
Commit
88c22a8
·
verified ·
1 Parent(s): 5299587

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ tags:
5
+ - vision-language-model
6
+ - reinforcement-learning
7
+ - grpo
8
+ - metaphor-understanding
9
+ - visual-reasoning
10
+ base_model: Qwen/Qwen2.5-VL
11
+ ---
12
+
13
+ # MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual RL
14
+
15
+ **MetaphorStar** is the first Multi-modal Large Language Model (MLLM) family trained via an **End-to-End Visual Reinforcement Learning (RL)** framework specifically designed to bridge the gap between literal perception ("seeing things as they are") and metaphorical understanding ("seeing things as we are").
16
+
17
+ Built upon the Qwen2.5-VL architecture, MetaphorStar achieves State-of-the-Art (SOTA) performance on image implication tasks and demonstrates robust generalization capabilities on complex visual reasoning benchmarks (e.g., MMMU, MathVerse).
18
+
19
+ ## 🌟 Key Highlights
20
+
21
+ * **SOTA on Image Implication:** Significantly outperforms GPT-4o, Claude-3.5-Sonnet, and Gemini-2.5-Pro on True-False and Open-Style image implication questions.
22
+ * **End-to-End Visual RL (TFQ-GRPO):** Utilizes the **True-False Question (TFQ)** format as a dense reward signal for Group Relative Policy Optimization (GRPO), bypassing the limitations of traditional Supervised Fine-Tuning (SFT).
23
+ * **Overcoming the "SFT Curse":** Our research identifies that SFT warmup creates an "entropy bottleneck" that harms generalization. MetaphorStar is trained with pure RL to maintain high policy entropy, enabling creative and robust reasoning.
24
+ * **Generalization:** Training on metaphors enhances the model's general visual reasoning ability (e.g., +16.2 points on MMMU for the 32B model compared to base).
25
+
26
+ ## 🧠 Methodology: TFQ-GRPO
27
+
28
+ Current MLLMs struggle with metaphors because they lack the sophisticated multi-hop reasoning and Theory of Mind (ToM) required. We introduce **TFQ-GRPO**, a framework that leverages:
29
+
30
+ 1. **TFQ-Data:** A fine-grained dataset where each image is associated with multiple True/False propositions, probing both literal content and deep implications.
31
+ 2. **GRPO (Group Relative Policy Optimization):** An on-policy RL algorithm that optimizes reasoning trajectories based on a combined reward of **Accuracy** (correct T/F judgment) and **Format** (structured thinking process).
32
+ 3. **Structured Reasoning:** The model is trained to explicitly output `<think>...</think>` traces before the final answer, allowing it to "find" the correct reasoning path through exploration.
33
+
34
+ ## 📊 Performance
35
+
36
+ Evaluation on **TFQ-Bench** and the **High-Level Image Implication Benchmark (EN)**:
37
+
38
+ | Model | TFQ (Acc) | MCQ (Acc) | OSQ (Score 0-5) |
39
+ | :--- | :---: | :---: | :---: |
40
+ | **MetaphorStar-32B** | **74%** | **78%** | **3.94** |
41
+ | **MetaphorStar-7B** | **70%** | **74%** | 3.22 |
42
+ | **MetaphorStar-3B** | 62% | 64% | 3.06 |
43
+ | Gemini-2.5-Pro | 68% | 82% | 3.38 |
44
+ | GPT-4o | 50% | 60% | 2.94 |
45
+ | Claude-3.5-Sonnet | 38% | 68% | 3.22 |
46
+
47
+ *Note: MetaphorStar-32B achieves SOTA on TFQ and OSQ, and outperforms top closed-source models on MCQ.*
48
+
49
+ ## 🚀 Quick Start
50
+
51
+ ```python
52
+ from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
53
+ from qwen_vl_utils import process_vision_info
54
+ import torch
55
+
56
+ model_id = "MING-ZCH/MetaphorStar-3B"
57
+
58
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
59
+ model_id, torch_dtype=torch.bfloat16, device_map="auto"
60
+ )
61
+ processor = AutoProcessor.from_pretrained(model_id)
62
+
63
+ messages = [
64
+ {
65
+ "role": "user",
66
+ "content": [
67
+ {"type": "image", "image": "path/to/metaphor_image.jpg"},
68
+ {"type": "text", "text": "True-false questions: The wilted plant in the office implies a stressful working environment.\n\nFirst, describe the image, then analyze the image implication, and finally reason to get the answer. Output the thinking process in <think></think> and the final correct answer in <answer></answer> tags."}
69
+ ]
70
+ }
71
+ ]
72
+
73
+ # Inference setup (standard Qwen2.5-VL generation)
74
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
75
+ inputs = processor(text=[text], images=[...], padding=True, return_tensors="pt").to("cuda")
76
+
77
+ generated_ids = model.generate(**inputs, max_new_tokens=2048)
78
+ output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
79
+ print(output_text)
80
+ ```
81
+
82
+ ## 📜 Citation
83
+
84
+ ```bibtex
85
+ @article{metaphorstar2026,
86
+ title={MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning},
87
+ author={Chenhao Zhang, Yazhe Niu, Hongsheng Li},
88
+ journal={Anonymous},
89
+ year={2026}
90
+ }
91
+ ```