File size: 4,756 Bytes

---
base_model: Qwen/Qwen2.5-VL
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- reinforcement-learning
- grpo
- metaphor-understanding
- visual-reasoning
---

# MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual RL

[[Paper](https://huggingface.co/papers/2602.10575)] [[Project Page](https://metaphorstar.github.io)] [[GitHub](https://github.com/MING-ZCH/MetaphorStar)]

**MetaphorStar** is the first Multi-modal Large Language Model (MLLM) family trained via an **End-to-End Visual Reinforcement Learning (RL)** framework specifically designed to bridge the gap between literal perception ("seeing things as they are") and metaphorical understanding ("seeing things as we are").

Built upon the Qwen2.5-VL architecture, MetaphorStar achieves State-of-the-Art (SOTA) performance on image implication tasks and demonstrates robust generalization capabilities on complex visual reasoning benchmarks (e.g., MMMU, MathVerse).

## 🌟 Key Highlights

* **SOTA on Image Implication:** Significantly outperforms GPT-4o, Claude-3.5-Sonnet, and Gemini-2.5-Pro on True-False and Open-Style image implication questions.
* **End-to-End Visual RL (TFQ-GRPO):** Utilizes the **True-False Question (TFQ)** format as a dense reward signal for Group Relative Policy Optimization (GRPO), bypassing the limitations of traditional Supervised Fine-Tuning (SFT).
* **Overcoming the "SFT Curse":** Our research identifies that SFT warmup creates an "entropy bottleneck" that harms generalization. MetaphorStar is trained with pure RL to maintain high policy entropy, enabling creative and robust reasoning.
* **Generalization:** Training on metaphors enhances the model's general visual reasoning ability (e.g., +16.2 points on MMMU for the 32B model compared to base).

## 🧠 Methodology: TFQ-GRPO

Current MLLMs struggle with metaphors because they lack the sophisticated multi-hop reasoning and Theory of Mind (ToM) required. We introduce **TFQ-GRPO**, a framework that leverages:

1.  **TFQ-Data:** A fine-grained dataset where each image is associated with multiple True/False propositions, probing both literal content and deep implications.
2.  **GRPO (Group Relative Policy Optimization):** An on-policy RL algorithm that optimizes reasoning trajectories based on a combined reward of **Accuracy** (correct T/F judgment) and **Format** (structured thinking process).
3.  **Structured Reasoning:** The model is trained to explicitly output `<think>...</think>` traces before the final answer, allowing it to "find" the correct reasoning path through exploration.

## 📊 Performance

Evaluation on **TFQ-Bench** and the **High-Level Image Implication Benchmark (EN)**:

| Model | TFQ (Acc) | MCQ (Acc) | OSQ (Score 0-5) |
| :--- | :---: | :---: | :---: |
| **MetaphorStar-32B** | **74%** | **78%** | **3.94** |
| **MetaphorStar-7B** | **70%** | **74%** | 3.22 |
| **MetaphorStar-3B** | 62% | 64% | 3.06 |
| Gemini-2.5-Pro | 68% | 82% | 3.38 |
| GPT-4o | 50% | 60% | 2.94 |
| Claude-3.5-Sonnet | 38% | 68% | 3.22 |

*Note: MetaphorStar-32B achieves SOTA on TFQ and OSQ, and outperforms top closed-source models on MCQ.*

## 🚀 Quick Start

```python
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
import torch

model_id = "MING-ZCH/MetaphorStar-7B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/metaphor_image.jpg"},
            {"type": "text", "text": "True-false questions: The wilted plant in the office implies a stressful working environment.

First, describe the image, then analyze the image implication, and finally reason to get the answer. Output the thinking process in <think></think> and the final correct answer in <answer></answer> tags."}
        ]
    }
]

# Inference setup (standard Qwen2.5-VL generation)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[...], padding=True, return_tensors="pt").to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=2048)
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(output_text)
```

## 📜 Citation

```bibtex
@article{metaphorstar2026,
  title={MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning},
  author={Chenhao Zhang, Yazhe Niu, Hongsheng Li},
  journal={arXiv preprint arXiv:2602.10575},
  year={2026}
}
```