| | --- |
| | base_model: Qwen/Qwen2.5-VL |
| | library_name: transformers |
| | license: apache-2.0 |
| | pipeline_tag: image-text-to-text |
| | arxiv: 2602.10575 |
| | tags: |
| | - vision-language-model |
| | - reinforcement-learning |
| | - grpo |
| | - metaphor-understanding |
| | - visual-reasoning |
| | --- |
| | |
| | # MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual RL |
| |
|
| | [**Paper**](https://huggingface.co/papers/2602.10575) | [**Project Page**](https://metaphorstar.github.io) | [**GitHub**](https://github.com/MING-ZCH/MetaphorStar) |
| |
|
| | **MetaphorStar** is the first Multi-modal Large Language Model (MLLM) family trained via an **End-to-End Visual Reinforcement Learning (RL)** framework specifically designed to bridge the gap between literal perception ("seeing things as they are") and metaphorical understanding ("seeing things as we are"). |
| |
|
| | Built upon the Qwen2.5-VL architecture, MetaphorStar achieves State-of-the-Art (SOTA) performance on image implication tasks and demonstrates robust generalization capabilities on complex visual reasoning benchmarks (e.g., MMMU, MathVerse). |
| |
|
| | ## π Key Highlights |
| |
|
| | * **SOTA on Image Implication:** Significantly outperforms GPT-4o, Claude-3.5-Sonnet, and Gemini-2.5-Pro on True-False and Open-Style image implication questions. |
| | * **End-to-End Visual RL (TFQ-GRPO):** Utilizes the **True-False Question (TFQ)** format as a dense reward signal for Group Relative Policy Optimization (GRPO), bypassing the limitations of traditional Supervised Fine-Tuning (SFT). |
| | * **Overcoming the "SFT Curse":** Our research identifies that SFT warmup creates an "entropy bottleneck" that harms generalization. MetaphorStar is trained with pure RL to maintain high policy entropy, enabling creative and robust reasoning. |
| | * **Generalization:** Training on metaphors enhances the model's general visual reasoning ability (e.g., +16.2 points on MMMU for the 32B model compared to base). |
| |
|
| | ## π§ Methodology: TFQ-GRPO |
| |
|
| | Current MLLMs struggle with metaphors because they lack the sophisticated multi-hop reasoning and Theory of Mind (ToM) required. We introduce **TFQ-GRPO**, a framework that leverages: |
| |
|
| | 1. **TFQ-Data:** A fine-grained dataset where each image is associated with multiple True/False propositions, probing both literal content and deep implications. |
| | 2. **GRPO (Group Relative Policy Optimization):** An on-policy RL algorithm that optimizes reasoning trajectories based on a combined reward of **Accuracy** (correct T/F judgment) and **Format** (structured thinking process). |
| | 3. **Structured Reasoning:** The model is trained to explicitly output `<think>...</think>` traces before the final answer, allowing it to "find" the correct reasoning path through exploration. |
| |
|
| | ## π Performance |
| |
|
| | Evaluation on **TFQ-Bench** and the **High-Level Image Implication Benchmark (EN)**: |
| |
|
| | | Model | TFQ (Acc) | MCQ (Acc) | OSQ (Score 0-5) | |
| | | :--- | :---: | :---: | :---: | |
| | | **MetaphorStar-32B** | **74%** | **78%** | **3.94** | |
| | | **MetaphorStar-7B** | **70%** | **74%** | 3.22 | |
| | | **MetaphorStar-3B** | 62% | 64% | 3.06 | |
| | | Gemini-2.5-Pro | 68% | 82% | 3.38 | |
| | | GPT-4o | 50% | 60% | 2.94 | |
| | | Claude-3.5-Sonnet | 38% | 68% | 3.22 | |
| |
|
| | *Note: MetaphorStar-32B achieves SOTA on TFQ and OSQ, and outperforms top closed-source models on MCQ.* |
| |
|
| | ## π Quick Start |
| |
|
| | ```python |
| | from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration |
| | from qwen_vl_utils import process_vision_info |
| | import torch |
| | |
| | model_id = "MING-ZCH/MetaphorStar-3B" |
| | |
| | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| | model_id, torch_dtype=torch.bfloat16, device_map="auto" |
| | ) |
| | processor = AutoProcessor.from_pretrained(model_id) |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "image", "image": "path/to/metaphor_image.jpg"}, |
| | {"type": "text", "text": "True-false questions: The wilted plant in the office implies a stressful working environment. |
| | |
| | First, describe the image, then analyze the image implication, and finally reason to get the answer. Output the thinking process in <think></think> and the final correct answer in <answer></answer> tags."} |
| | ] |
| | } |
| | ] |
| | |
| | # Inference setup (standard Qwen2.5-VL generation) |
| | text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = processor(text=[text], images=[...], padding=True, return_tensors="pt").to("cuda") |
| | |
| | generated_ids = model.generate(**inputs, max_new_tokens=2048) |
| | output_text = processor.batch_decode(generated_ids, skip_special_tokens=True) |
| | print(output_text) |
| | ``` |
| |
|
| | ## π Citation |
| |
|
| | ```bibtex |
| | @article{metaphorstar2026, |
| | title={MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning}, |
| | author={Chenhao Zhang, Yazhe Niu, Hongsheng Li}, |
| | journal={arXiv preprint arXiv:2602.10575}, |
| | year={2026} |
| | } |
| | ``` |