File size: 5,100 Bytes
2dc0442
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-3B-Instruct
tags:
  - multimodal
  - vision-language
  - visual-reasoning
  - reinforcement-learning
  - qwen2.5-vl
  - math
  - reasoning
datasets:
  - OpenMMReasoner-Data
language:
  - en
pipeline_tag: image-text-to-text
library_name: transformers
---

# Frankenstein-RL

**Frankenstein-RL** is the reinforced (reinforcement training after cold-start initialization) model from the paper:

> **[What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis](https://arxiv.org/abs/2602.12395)**
>
> Xirui Li\*, Ming Li\*, Tianyi Zhou
>
> University of Maryland  |  Mohamed bin Zayed University of Artificial Intelligence
>
> *(\* Co-first Authors)*

This model serves as the **IN (Instruction-tuned) checkpoint** before reinforcement learning, built on the [OpenMMReasoner](https://arxiv.org/abs/2511.16334) training recipe with [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) as the base model.

## Overview

Our paper introduces a **Frankenstein-style analysis framework** to understand *what* reinforcement learning (RL) actually improves in vision-language models (VLMs) for visual reasoning. Rather than relying on end-to-end benchmark scores, we decompose VLMs at the granularity of transformer layers and probe their functional roles through:

1. **Functional Localization via Causal Probing** — localizing vision- and reasoning-related computations along transformer depth
2. **Update Characterization via Parameter Comparison** — showing that IN and RL differ systematically in update magnitude and geometry
3. **Transferability Test via Model Merging** — transplanting RL-refined regions into IN models to test causal contributions

### Key Findings

- RL does **not** uniformly improve visual perception or standalone reasoning
- RL induces **structured refinements concentrated in mid-to-late layers**, improving vision-to-reasoning alignment
- These mid-to-late refinements are both **transferable** (via merging) and **necessary** (via freezing) for RL gains
- Freezing **late layers** during RL training leads to a pronounced drop in reasoning performance

## Evaluation Results

### Fine-grained and Benchmark Metrics

| Model | Vision (M_vis) | Vision-to-Reasoning (M_v2r) | Reasoning (M_rea) | MathVista | MathVision | MathVerse |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| **Frankenstein-IN** (this model) | 34.0 | 21.0 | 26.0 | 46.5 | 18.4 | 37.0 |
| Frankenstein-RL | 33.0 | 29.0 | 34.0 | 48.1 | 14.1 | 37.8 |

### Parameter Freezing Analysis (RL Training)

| Model | Vision (M_vis) | Vision-to-Reasoning (M_v2r) | Reasoning (M_rea) | MathVista | MathVision | MathVerse |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| RL - Frozen **Early** Block | **35.0** | **31.0** | 36.0 | **48.2** | **21.0** | 34.5 |
| RL - Frozen **Mid** Block | 25.0 | 29.0 | **38.0** | 46.5 | 15.5 | **35.7** |
| RL - Frozen **Late** Block | 30.0 | 27.0 | 34.0 | 47.9 | 16.8 | 35.0 |

## Quick Start

### Installation

```bash
pip install transformers accelerate
pip install qwen-vl-utils[decord]==0.0.8
```

### Inference

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "AIcell/Frankenstein-IN",
    torch_dtype="auto",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained("AIcell/Frankenstein-IN")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://your-image-url.jpg"},
            {"type": "text", "text": "Please solve this math problem step by step."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```

## Related Resources

| Resource | Link |
|:---|:---|
| Paper | [arXiv:2602.12395](https://arxiv.org/abs/2602.12395) |
| Frankenstein-RL Model | [AIcell/Frankenstein-RL](https://huggingface.co/AIcell/Frankenstein-RL) |
| Base Model | [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) |
| OpenMMReasoner | [arXiv:2511.16334](https://arxiv.org/abs/2511.16334) |

## Citation

```bibtex
@article{li2026frankenstein,
  title={What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis},
  author={Li, Xirui and Li, Ming and Zhou, Tianyi},
  journal={arXiv preprint arXiv:2602.12395},
  year={2026}
}
```

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).