MetaphorStar-3B / README.md
MING-ZCH's picture
Add pipeline tag and external links (#1)
e01fef2
---
base_model: Qwen/Qwen2.5-VL
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
arxiv: 2602.10575
tags:
- vision-language-model
- reinforcement-learning
- grpo
- metaphor-understanding
- visual-reasoning
---
# MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual RL
[**Paper**](https://huggingface.co/papers/2602.10575) | [**Project Page**](https://metaphorstar.github.io) | [**GitHub**](https://github.com/MING-ZCH/MetaphorStar)
**MetaphorStar** is the first Multi-modal Large Language Model (MLLM) family trained via an **End-to-End Visual Reinforcement Learning (RL)** framework specifically designed to bridge the gap between literal perception ("seeing things as they are") and metaphorical understanding ("seeing things as we are").
Built upon the Qwen2.5-VL architecture, MetaphorStar achieves State-of-the-Art (SOTA) performance on image implication tasks and demonstrates robust generalization capabilities on complex visual reasoning benchmarks (e.g., MMMU, MathVerse).
## 🌟 Key Highlights
* **SOTA on Image Implication:** Significantly outperforms GPT-4o, Claude-3.5-Sonnet, and Gemini-2.5-Pro on True-False and Open-Style image implication questions.
* **End-to-End Visual RL (TFQ-GRPO):** Utilizes the **True-False Question (TFQ)** format as a dense reward signal for Group Relative Policy Optimization (GRPO), bypassing the limitations of traditional Supervised Fine-Tuning (SFT).
* **Overcoming the "SFT Curse":** Our research identifies that SFT warmup creates an "entropy bottleneck" that harms generalization. MetaphorStar is trained with pure RL to maintain high policy entropy, enabling creative and robust reasoning.
* **Generalization:** Training on metaphors enhances the model's general visual reasoning ability (e.g., +16.2 points on MMMU for the 32B model compared to base).
## 🧠 Methodology: TFQ-GRPO
Current MLLMs struggle with metaphors because they lack the sophisticated multi-hop reasoning and Theory of Mind (ToM) required. We introduce **TFQ-GRPO**, a framework that leverages:
1. **TFQ-Data:** A fine-grained dataset where each image is associated with multiple True/False propositions, probing both literal content and deep implications.
2. **GRPO (Group Relative Policy Optimization):** An on-policy RL algorithm that optimizes reasoning trajectories based on a combined reward of **Accuracy** (correct T/F judgment) and **Format** (structured thinking process).
3. **Structured Reasoning:** The model is trained to explicitly output `<think>...</think>` traces before the final answer, allowing it to "find" the correct reasoning path through exploration.
## πŸ“Š Performance
Evaluation on **TFQ-Bench** and the **High-Level Image Implication Benchmark (EN)**:
| Model | TFQ (Acc) | MCQ (Acc) | OSQ (Score 0-5) |
| :--- | :---: | :---: | :---: |
| **MetaphorStar-32B** | **74%** | **78%** | **3.94** |
| **MetaphorStar-7B** | **70%** | **74%** | 3.22 |
| **MetaphorStar-3B** | 62% | 64% | 3.06 |
| Gemini-2.5-Pro | 68% | 82% | 3.38 |
| GPT-4o | 50% | 60% | 2.94 |
| Claude-3.5-Sonnet | 38% | 68% | 3.22 |
*Note: MetaphorStar-32B achieves SOTA on TFQ and OSQ, and outperforms top closed-source models on MCQ.*
## πŸš€ Quick Start
```python
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
import torch
model_id = "MING-ZCH/MetaphorStar-3B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/metaphor_image.jpg"},
{"type": "text", "text": "True-false questions: The wilted plant in the office implies a stressful working environment.
First, describe the image, then analyze the image implication, and finally reason to get the answer. Output the thinking process in <think></think> and the final correct answer in <answer></answer> tags."}
]
}
]
# Inference setup (standard Qwen2.5-VL generation)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[...], padding=True, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=2048)
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(output_text)
```
## πŸ“œ Citation
```bibtex
@article{metaphorstar2026,
title={MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning},
author={Chenhao Zhang, Yazhe Niu, Hongsheng Li},
journal={arXiv preprint arXiv:2602.10575},
year={2026}
}
```