Enhance model card with paper details, framework, results, and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +187 -3
README.md CHANGED
@@ -1,3 +1,187 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-to-text
4
+ library_name: transformers
5
+ ---
6
+
7
+ # Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback
8
+
9
+ This repository contains the official models and resources for the paper "[Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback](https://huggingface.co/papers/2507.20766)".
10
+
11
+ **Official GitHub Repository**: [https://github.com/syficy/RRVF](https://github.com/syficy/RRVF)
12
+
13
+ ## Abstract
14
+ Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework, ``Reasoning-Rendering-Visual-Feedback'' (RRVF), that enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the ``Asymmetry of Verification'' principle, i.e., verifying the rendered output against the source image is substantially easier than performing deep visual reasoning to generate a faithful, structured representation such as code. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL), thereby reducing reliance on image-text supervision. RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform complex reasoning, including self-correction through multi-turn interactions. This process is optimized end-to-end using the GRPO algorithm. Extensive evaluations are conducted on image-to-code generation across two diverse domains: data charts and web interfaces. The RRVF-trained model not only outperforms existing similarly sized open-source MLLMs and supervised fine-tuning baselines but also exhibits superior generalization. Notably, the model outperforms the more advanced MLLM used to generate visual feedback during training.
15
+
16
+ ## Framework Overview
17
+
18
+ RRVF is a training framework that enhances the visual reasoning capabilities of MLLMs using **purely visual signals**. Its core is a closed-loop system comprising three key components: an iterative visual reasoner, a visual feedback mechanism, and a final visual judge.
19
+
20
+ <div align="center">
21
+ <img src="https://github.com/syficy/RRVF/raw/main/assets/overview.png" width="900">
22
+ <p>Figure: The RRVF framework.</p>
23
+ </div>
24
+
25
+ ### Iterative Visual Reasoning
26
+
27
+ The reasoning process is iterative. The model receives an image and produces a response containing internal thoughts (in `<think>` tags) and a specific action (in a `<tool_call>` tag). After the tool executes the code, the visual feedback is appended to the conversation history to inform the model's next turn, until it generates a final solution (in an `<answer>` tag).
28
+
29
+ ### Visual Feedback Mechanism
30
+
31
+ This is the key to guiding the model's learning.
32
+ 1. **Rendering**: The model-generated code is executed by a domain-specific engine (e.g., Matplotlib for charts, Playwright for web pages) to render an image.
33
+ 2. **Comparison & Feedback**: A more powerful "teacher" MLLM compares the rendered image to the original, articulating the visual discrepancies (e.g., color, layout, missing elements) in natural language. This descriptive feedback provides actionable guidance for the model.
34
+
35
+ ### Reinforcement Learning Optimization
36
+
37
+ The entire closed-loop process is formulated as a reinforcement learning task and optimized with the **GRPO** algorithm. We designed a hybrid reward function to guide the learning:
38
+ - **Visual Similarity Reward (R_vision)**: Provided by the visual judge, this quantifies the fidelity between the final rendered image and the original input.
39
+ - **Format Correctness Reward (R_format)**: Penalizes improper output formatting and non-executable code.
40
+ - **Tool-Use Reward (R_tool)**: Encourages exploration and iterative refinement by rewarding successful tool calls.
41
+
42
+ ## Main Results
43
+
44
+ ### Chart-to-Code Task
45
+
46
+ #### Results on the ChartMimic test set:
47
+ | **Model** | **Exec rate** | **Text** | **Layout** | **Type** | **Color** | **GPT-4o score** | **Overall** |
48
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
49
+ | ***Closed-Source MLLMs*** | | | | | | | |
50
+ | (2024/02) Gemini-1.0-Pro-Vision | 68.2* | 52.6* | 64.2* | 51.3* | 47.1* | 53.3* | 53.6* |
51
+ | (2024/11) GPT-4o-2024-11-20 | 90.00 | 66.55 | 79.31 | 71.83 | 60.84 | 82.50 | 76.06 |
52
+ | (2025/04) OpenAI o3 | 90.17 | 74.17 | 80.58 | 71.37 | 63.74 | 86.45 | 79.46 |
53
+ | (2025/05) Claude-4-Sonnet | 91.83 | 68.87 | 82.43 | 67.13 | 57.59 | 85.46 | 77.23 |
54
+ | (2025/06) Gemini-2.5-Pro | 93.33 | 84.95 | 83.37 | 75.05 | 66.90 | 90.58 | 84.07 |
55
+ | ***Open-Source MLLMs*** | | | | | | | |
56
+ | (2025/02) Qwen2.5-VL-72B-Instruct| 83.83 | 34.44 | 61.71 | 45.49 | 35.12 | 50.41 | 47.30 |
57
+ | (2024/03) DeepSeek-VL-7B | 41.3* | 15.3* | 26.6* | 19.7* | 14.5* | 20.4* | 19.7* |
58
+ | (2025/02) LLaVA-OneVision-7B | 17.28 | 7.97 | 13.55 | 9.15 | 7.36 | 10.01 | 9.76 |
59
+ | (2025/02) Qwen2.5-VL-7B-Instruct| 68.83 | 30.01 | 55.79 | 36.50 | 26.91 | 39.04 | 38.17 |
60
+ | (2025/04) InternVL3-8B | <u>71.67</u>| 45.03 | 57.89 | 45.87 | 38.88 | 54.91 | 50.91 |
61
+ | --- | --- | --- | --- | --- | --- | --- | --- |
62
+ | SFT [with text labels] | 69.00 | <u>56.97</u> | <u>63.60</u> | **60.53** | **51.89** | <u>62.09</u> | <u>60.17</u> |
63
+ | Δ (vs Qwen2.5-VL-7B-Instruct) | +0.17 | +26.96 | +7.81 | +24.03 | +24.98 | +23.05 | +22.00 |
64
+ | --- | --- | --- | --- | --- | --- | --- | --- |
65
+ | RRVF (Ours) [without text labels]| **97.83**| **62.47**| **80.97**| <u>53.56</u>| <u>46.41</u>| **67.87**| **64.36**|
66
+ | Δ (vs Qwen2.5-VL-7B-Instruct) | +29.00 | +32.46 | +25.18 | +17.06 | +19.50 | +28.83 | +26.19 |
67
+
68
+ <br>
69
+
70
+ **note:**
71
+ Performance comparison on the ChartMimic benchmark. We report the metrics from the original ChartMimic benchmark. The best and second-best results among open-source models under 10B parameters are **bolded** and <u>underlined</u>, respectively. Results marked with * are reported by the original benchmark.
72
+
73
+ ---
74
+
75
+ #### Results on Plot2Code (Zero-Shot):
76
+ | **Model** | **Exec Rate** | **Text** | **GPT-4o Score** | ***Text<sub>pass</sub>*** | ***GPT-4o Score<sub>pass</sub>*** |
77
+ | :--- | :---: | :---: | :---: | :---: | :---: |
78
+ | ***Closed-Source MLLMs*** | | | | | |
79
+ | (2023/09) GPT-4V | 84.1* | 48.53* | 5.45* | *57.7\** | *6.48\** |
80
+ | (2024/02) Gemini-1.0-Pro-Vision | 68.2* | 36.56* | 3.45* | *53.6\** | *5.06\** |
81
+ | (2024/06) Claude-3-Sonnet | 75.8* | 35.40* | 4.08* | *46.7\** | *5.38\** |
82
+ | (2024/11) GPT-4o-2024-11-20 | 90.15 | 48.91 | 6.09 | *54.25* | *6.76* |
83
+ | (2025/04) OpenAI o3 | 87.12 | 57.65 | 6.70 | *66.17* | *7.69* |
84
+ | (2025/05) Claude-4-Sonnet | 92.42 | 56.86 | 6.16 | *61.52* | *6.76* |
85
+ | (2025/06) Gemini-2.5-Pro | 87.88 | 71.70 | 7.65 | *81.59* | *8.71* |
86
+ | ***Open-Source MLLMs*** | | | | | |
87
+ | (2025/02) Qwen2.5-VL-72B-Instruct | 83.33 | 56.74 | 5.79 | *68.09* | *6.95* |
88
+ | (2024/03) Mini-Gemini-8x7B-HD | 73.5*| 29.91*| 2.84*| *40.7\**| *3.87\**|
89
+ | (2025/02) LLaVA-OneVision-7B | <u>84.09</u>| 26.72 | 2.75 | *31.78*| *3.27* |
90
+ | (2025/02) Qwen2.5-VL-7B-Instruct| 70.46 | <u>35.80</u>| <u>3.40</u>| *50.81*| *4.82* |
91
+ | (2025/04) InternVL3-8B | 76.52 | 30.67 | 3.25 | *40.08*| *4.25* |
92
+ | --- | --- | --- | --- | --- | --- |
93
+ | SFT [with text labels, ChartMimic trained] | 49.24 | 21.63 | 2.47 | *43.93*| *5.02* |
94
+ | Δ (vs Qwen2.5-VL-7B-Instruct) | -21.22| -14.17| -0.93 | - | - |
95
+ | --- | --- | --- | --- | --- | --- |
96
+ | RRVF (Ours) [without text labels] | **96.21**| **39.89**| **4.44**| *41.46*| *4.61* |
97
+ | Δ (vs Qwen2.5-VL-7B-Instruct) | +25.75| +4.09 | +1.04 | - | - |
98
+
99
+ <br>
100
+
101
+ **note:**
102
+ Performance comparison on the Plot2Code benchmark. The best and second-best results on the primary metrics (Exec Rate, Text, GPT-4o Score) among open-source models under 10B parameters are **bolded** and <u>underlined</u>, respectively. Results marked with * are reported by the original benchmark.
103
+
104
+ ---
105
+
106
+ ### Web-to-Code Task
107
+ #### Results on the WebSight test set:
108
+ | **Model** | **CLIP Score** | **GPT Score** |
109
+ | :--- | :---: | :---: |
110
+ | ***Closed-Source MLLMs*** | | |
111
+ | GPT-4o-2024-11-20 | 88.94 | 94.55 |
112
+ | OpenAI o3 | 91.58 | 96.49 |
113
+ | Claude-4-Sonnet | 92.30 | 96.46 |
114
+ | Gemini-2.5-Pro | 77.83 | 75.88 |
115
+ | ***Open-Source MLLMs*** | | |
116
+ | LLaVA-OneVision-7B | 79.74 | 72.61 |
117
+ | Qwen2.5-VL-7B-Instruct | 83.50 | 84.17 |
118
+ | InternVL3-8B | 84.17 | 85.54 |
119
+ | --- | --- | --- |
120
+ | **RRVF (Ours)** | **88.29** | **91.50** |
121
+
122
+ <br>
123
+
124
+ **note:**
125
+ Performance comparison on the WebSight benchmark for web interface generation. The best results among open-source models under 10B parameters are **bolded**.
126
+
127
+ ---
128
+
129
+ ## Usage
130
+
131
+ You can use the model with the `transformers` library. Below is a simplified example for inference:
132
+
133
+ ```python
134
+ from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
135
+ from PIL import Image
136
+ import torch
137
+
138
+ # Load model and processor
139
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
140
+ "chenzju/rrvf_chartmimic", # This is for the ChartMimic model, for WebSight use "chenzju/rrvf_websight"
141
+ torch_dtype="auto",
142
+ device_map="auto"
143
+ )
144
+ processor = AutoProcessor.from_pretrained("chenzju/rrvf_chartmimic")
145
+
146
+ # Prepare image and text prompt
147
+ # Replace with your actual image path
148
+ image = Image.open("./path/to/your/chart_image.png").convert("RGB")
149
+ prompt = "Generate the code for this chart."
150
+
151
+ messages = [
152
+ {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]}
153
+ ]
154
+
155
+ # Apply chat template and process inputs
156
+ text = processor.apply_chat_template(
157
+ messages, tokenize=False, add_generation_prompt=True
158
+ )
159
+ inputs = processor(text=[text], images=[image], return_tensors="pt")
160
+ inputs = inputs.to(model.device)
161
+
162
+ # Generate response
163
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
164
+
165
+ # Decode and print the output
166
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
167
+ print(generated_text)
168
+ ```
169
+
170
+ ## Citation
171
+ If you use our work in your research, please cite our paper:
172
+ ```bibtex
173
+ @misc{chen2025learningimagesvisualreinforcement,
174
+ title={Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback},
175
+ author={Yang Chen and Yufan Shen and Wenxuan Huang and Sheng Zhou and Qunshu Lin and Xinyu Cai and Zhi Yu and Jiajun Bu and Botian Shi and Yu Qiao},
176
+ year={2025},
177
+ eprint={2507.20766},
178
+ archivePrefix={arXiv},
179
+ primaryClass={cs.CV},
180
+ url={https://arxiv.org/abs/2507.20766},
181
+ }
182
+ ```
183
+
184
+ ## Acknowledgements
185
+ - We thank the Verl, DeepEyes framework.
186
+ - We thank the creators of the ChartMimic, Plot2Code, and WebSight datasets.
187
+ - We thank the VlmEvalkit team.