File size: 5,100 Bytes
2dc0442 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | ---
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-3B-Instruct
tags:
- multimodal
- vision-language
- visual-reasoning
- reinforcement-learning
- qwen2.5-vl
- math
- reasoning
datasets:
- OpenMMReasoner-Data
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
---
# Frankenstein-RL
**Frankenstein-RL** is the reinforced (reinforcement training after cold-start initialization) model from the paper:
> **[What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis](https://arxiv.org/abs/2602.12395)**
>
> Xirui Li\*, Ming Li\*, Tianyi Zhou
>
> University of Maryland | Mohamed bin Zayed University of Artificial Intelligence
>
> *(\* Co-first Authors)*
This model serves as the **IN (Instruction-tuned) checkpoint** before reinforcement learning, built on the [OpenMMReasoner](https://arxiv.org/abs/2511.16334) training recipe with [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) as the base model.
## Overview
Our paper introduces a **Frankenstein-style analysis framework** to understand *what* reinforcement learning (RL) actually improves in vision-language models (VLMs) for visual reasoning. Rather than relying on end-to-end benchmark scores, we decompose VLMs at the granularity of transformer layers and probe their functional roles through:
1. **Functional Localization via Causal Probing** — localizing vision- and reasoning-related computations along transformer depth
2. **Update Characterization via Parameter Comparison** — showing that IN and RL differ systematically in update magnitude and geometry
3. **Transferability Test via Model Merging** — transplanting RL-refined regions into IN models to test causal contributions
### Key Findings
- RL does **not** uniformly improve visual perception or standalone reasoning
- RL induces **structured refinements concentrated in mid-to-late layers**, improving vision-to-reasoning alignment
- These mid-to-late refinements are both **transferable** (via merging) and **necessary** (via freezing) for RL gains
- Freezing **late layers** during RL training leads to a pronounced drop in reasoning performance
## Evaluation Results
### Fine-grained and Benchmark Metrics
| Model | Vision (M_vis) | Vision-to-Reasoning (M_v2r) | Reasoning (M_rea) | MathVista | MathVision | MathVerse |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| **Frankenstein-IN** (this model) | 34.0 | 21.0 | 26.0 | 46.5 | 18.4 | 37.0 |
| Frankenstein-RL | 33.0 | 29.0 | 34.0 | 48.1 | 14.1 | 37.8 |
### Parameter Freezing Analysis (RL Training)
| Model | Vision (M_vis) | Vision-to-Reasoning (M_v2r) | Reasoning (M_rea) | MathVista | MathVision | MathVerse |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| RL - Frozen **Early** Block | **35.0** | **31.0** | 36.0 | **48.2** | **21.0** | 34.5 |
| RL - Frozen **Mid** Block | 25.0 | 29.0 | **38.0** | 46.5 | 15.5 | **35.7** |
| RL - Frozen **Late** Block | 30.0 | 27.0 | 34.0 | 47.9 | 16.8 | 35.0 |
## Quick Start
### Installation
```bash
pip install transformers accelerate
pip install qwen-vl-utils[decord]==0.0.8
```
### Inference
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"AIcell/Frankenstein-IN",
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("AIcell/Frankenstein-IN")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://your-image-url.jpg"},
{"type": "text", "text": "Please solve this math problem step by step."},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```
## Related Resources
| Resource | Link |
|:---|:---|
| Paper | [arXiv:2602.12395](https://arxiv.org/abs/2602.12395) |
| Frankenstein-RL Model | [AIcell/Frankenstein-RL](https://huggingface.co/AIcell/Frankenstein-RL) |
| Base Model | [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) |
| OpenMMReasoner | [arXiv:2511.16334](https://arxiv.org/abs/2511.16334) |
## Citation
```bibtex
@article{li2026frankenstein,
title={What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis},
author={Li, Xirui and Li, Ming and Zhou, Tianyi},
journal={arXiv preprint arXiv:2602.12395},
year={2026}
}
```
## License
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). |