File size: 4,202 Bytes
29ff98f
 
 
50007e8
 
 
8c06651
29ff98f
 
 
 
 
 
 
 
 
376e4ab
29ff98f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50007e8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: apache-2.0
tags:
- multimodal
- vision-language
- latent-reasoning
pipeline_tag: visual-question-answering
base_model: Qwen/Qwen2.5-VL-7B-Instruct
---

# V-Reflection (7B)

**V-Reflection** is a Qwen2.5-VL-based multimodal model that turns the MLLM into an **active interrogator** via a *think-then-look* **visual reflection** mechanism: fixed-length latent visual reasoning before answering.

| Resource | Link |
|----------|------|
| Paper | [arxiv.org/abs/2604.03307](https://arxiv.org/abs/2604.03307) |
| Project page | [idea-research.github.io/V-Reflection](https://idea-research.github.io/V-Reflection/) |
| Code & scripts | [github.com/IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection) |

## Model summary

- **Architecture:** Built on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) with **Box-Guided Compression (BCM)** and **Dynamic Autoregressive Compression (DAC)** for latent visual probing.
- **Stage 1 (BCM):** Box-guided compression produces stable pixel-to-latent targets; *stochastic decoupled alignment* trains the resampler and LLM jointly.
- **Stage 2 (DAC):** Student maps LLM hidden states into dynamic probes over the global visual map, with MSE distillation from a frozen BCM teacher.
- **Inference:** BCM/DAC are inactive at inference; decoding is end-to-end autoregressive in latent space (8-step latent reasoning by default).

<p align="center">
  <img src="https://raw.githubusercontent.com/IDEA-Research/V-Reflection/main/images/Framework.png" width="85%" alt="V-Reflection framework">
</p>

## Repository contents

This Hub repo includes **model weights** (sharded Safetensors) and **pre-formatted LVR training annotations** used in the paper project:

| File(s) | Role |
|---------|------|
| `*.safetensors`, `config.json`, tokenizer assets | Fine-tuned **V-Reflection** checkpoint |
| `meta_data_lvr_sft_stage1.json` | Meta config for default SROIE + DUDE mix |
| `viscot_sroie_dude_lvr_formatted.json` | SROIE + DUDE subset |
| `viscot_363k_lvr_formatted.json` | Full Visual CoT–style 363K split |

Download images for Visual CoT / listed datasets from their official sources (see the [code repo data section](https://github.com/IDEA-Research/V-Reflection#data-preparation)).

## Intended use

- Research on **visual reasoning**, **high-resolution understanding**, and **latent multimodal reasoning**.
- **Not** a drop-in replacement for stock Qwen2.5-VL in generic chat UIs without the official **custom model class** and generation path.

## How to run

Loading and evaluation rely on the **`QwenWithLVR`** implementation and scripts in the GitHub repository (training, packing, and benchmarks are documented there).

1. Clone **[IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection)** and follow `README.md` (environment, data layout).
2. Download this checkpoint into a local directory (or rely on `hf_hub_download` from their evaluation scripts).
3. Point `EVAL_CHECKPOINT_PATH` (or the training script checkpoint args) to the unpacked folder.

Example evaluation entrypoint (from upstream docs):

```bash
bash scripts_release/evaluation/evaluation_7b_stage2.sh
```

## Results (reported in project materials)

| Benchmark | V-Reflection | Qwen2.5-VL-7B |
|-----------|:------------:|:-------------:|
| MMVP | **72.3** | 66.7 |
| BLINK | **56.4** | 54.5 |
| V* | **81.7** | 78.5 |
| HRBench-4K | **72.6** | 68.0 |
| HRBench-8K | **66.3** | 63.8 |
| MME-RealWorld-Lite | **53.9** | 45.8 |

## Limitations

- Requires the **official codebase** for correct loading and latent reasoning behavior.
- Performance and safety have **not** been audited for unrestricted production deployment; use judgment for real-world products.
- Biases may mirror base model and training data distributions.

## License

Apache-2.0 (same as the codebase). See the [license file](https://github.com/IDEA-Research/V-Reflection/blob/main/LICENSE) in the GitHub repository.

## Acknowledgements

[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [LVR](https://github.com/VincentLeebang/lvr), [Visual-CoT](https://github.com/deepcs233/Visual-CoT), [InternVL](https://github.com/OpenGVLab/InternVL).