Add pipeline tag and external links (#1)

e01fef2 11 days ago

4.78 kB

	---
	base_model: Qwen/Qwen2.5-VL
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-text-to-text
	arxiv: 2602.10575
	tags:
	- vision-language-model
	- reinforcement-learning
	- grpo
	- metaphor-understanding
	- visual-reasoning
	---

	# MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual RL

	[Paper](https://huggingface.co/papers/2602.10575) \| [Project Page](https://metaphorstar.github.io) \| [GitHub](https://github.com/MING-ZCH/MetaphorStar)

	MetaphorStar is the first Multi-modal Large Language Model (MLLM) family trained via an End-to-End Visual Reinforcement Learning (RL) framework specifically designed to bridge the gap between literal perception ("seeing things as they are") and metaphorical understanding ("seeing things as we are").

	Built upon the Qwen2.5-VL architecture, MetaphorStar achieves State-of-the-Art (SOTA) performance on image implication tasks and demonstrates robust generalization capabilities on complex visual reasoning benchmarks (e.g., MMMU, MathVerse).

	## 🌟 Key Highlights

	* SOTA on Image Implication: Significantly outperforms GPT-4o, Claude-3.5-Sonnet, and Gemini-2.5-Pro on True-False and Open-Style image implication questions.
	* End-to-End Visual RL (TFQ-GRPO): Utilizes the True-False Question (TFQ) format as a dense reward signal for Group Relative Policy Optimization (GRPO), bypassing the limitations of traditional Supervised Fine-Tuning (SFT).
	* Overcoming the "SFT Curse": Our research identifies that SFT warmup creates an "entropy bottleneck" that harms generalization. MetaphorStar is trained with pure RL to maintain high policy entropy, enabling creative and robust reasoning.
	* Generalization: Training on metaphors enhances the model's general visual reasoning ability (e.g., +16.2 points on MMMU for the 32B model compared to base).

	## 🧠 Methodology: TFQ-GRPO

	Current MLLMs struggle with metaphors because they lack the sophisticated multi-hop reasoning and Theory of Mind (ToM) required. We introduce TFQ-GRPO, a framework that leverages:

	1. TFQ-Data: A fine-grained dataset where each image is associated with multiple True/False propositions, probing both literal content and deep implications.
	2. GRPO (Group Relative Policy Optimization): An on-policy RL algorithm that optimizes reasoning trajectories based on a combined reward of Accuracy (correct T/F judgment) and Format (structured thinking process).
	3. Structured Reasoning: The model is trained to explicitly output `<think>...</think>` traces before the final answer, allowing it to "find" the correct reasoning path through exploration.

	## 📊 Performance

	Evaluation on TFQ-Bench and the High-Level Image Implication Benchmark (EN):

	\| Model \| TFQ (Acc) \| MCQ (Acc) \| OSQ (Score 0-5) \|
	\| :--- \| :---: \| :---: \| :---: \|
	\| MetaphorStar-32B \| 74% \| 78% \| 3.94 \|
	\| MetaphorStar-7B \| 70% \| 74% \| 3.22 \|
	\| MetaphorStar-3B \| 62% \| 64% \| 3.06 \|
	\| Gemini-2.5-Pro \| 68% \| 82% \| 3.38 \|
	\| GPT-4o \| 50% \| 60% \| 2.94 \|
	\| Claude-3.5-Sonnet \| 38% \| 68% \| 3.22 \|

	Note: MetaphorStar-32B achieves SOTA on TFQ and OSQ, and outperforms top closed-source models on MCQ.

	## 🚀 Quick Start

	```python
	from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
	from qwen_vl_utils import process_vision_info
	import torch

	model_id = "MING-ZCH/MetaphorStar-3B"

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	model_id, torch_dtype=torch.bfloat16, device_map="auto"
	)
	processor = AutoProcessor.from_pretrained(model_id)

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "path/to/metaphor_image.jpg"},
	{"type": "text", "text": "True-false questions: The wilted plant in the office implies a stressful working environment.

	First, describe the image, then analyze the image implication, and finally reason to get the answer. Output the thinking process in <think></think> and the final correct answer in <answer></answer> tags."}
	]
	}
	]

	# Inference setup (standard Qwen2.5-VL generation)
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[text], images=[...], padding=True, return_tensors="pt").to("cuda")

	generated_ids = model.generate(**inputs, max_new_tokens=2048)
	output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
	print(output_text)
	```

	## 📜 Citation

	```bibtex
	@article{metaphorstar2026,
	title={MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning},
	author={Chenhao Zhang, Yazhe Niu, Hongsheng Li},
	journal={arXiv preprint arXiv:2602.10575},
	year={2026}
	}
	```