garlandchou
/

V-Reflection

+---
+license: apache-2.0
+language:
+  - en
+tags:
+  - multimodal
+  - vision-language
+  - visual-reasoning
+  - latent-reasoning
+  - qwen2_5_vl
+library_name: transformers
+pipeline_tag: image-text-to-text
+base_model: Qwen/Qwen2.5-VL-7B-Instruct
+---
+# V-Reflection (7B)
+**V-Reflection** is a Qwen2.5-VL-based multimodal model that turns the MLLM into an **active interrogator** via a *think-then-look* **visual reflection** mechanism: fixed-length latent visual reasoning before answering.
+| Resource | Link |
+|----------|------|
+| Project page | [idea-research.github.io/V-Reflection](https://idea-research.github.io/V-Reflection/) |
+| Code & scripts | [github.com/IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection) |
+## Model summary
+- **Architecture:** Built on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) with **Box-Guided Compression (BCM)** and **Dynamic Autoregressive Compression (DAC)** for latent visual probing.
+- **Stage 1 (BCM):** Box-guided compression produces stable pixel-to-latent targets; *stochastic decoupled alignment* trains the resampler and LLM jointly.
+- **Stage 2 (DAC):** Student maps LLM hidden states into dynamic probes over the global visual map, with MSE distillation from a frozen BCM teacher.
+- **Inference:** BCM/DAC are inactive at inference; decoding is end-to-end autoregressive in latent space (8-step latent reasoning by default).
+<p align="center">
+  <img src="https://raw.githubusercontent.com/IDEA-Research/V-Reflection/main/images/Framework.png" width="85%" alt="V-Reflection framework">
+</p>
+## Repository contents
+This Hub repo includes **model weights** (sharded Safetensors) and **pre-formatted LVR training annotations** used in the paper project:
+| File(s) | Role |
+|---------|------|
+| `*.safetensors`, `config.json`, tokenizer assets | Fine-tuned **V-Reflection** checkpoint |
+| `meta_data_lvr_sft_stage1.json` | Meta config for default SROIE + DUDE mix |
+| `viscot_sroie_dude_lvr_formatted.json` | SROIE + DUDE subset |
+| `viscot_363k_lvr_formatted.json` | Full Visual CoT–style 363K split |
+Download images for Visual CoT / listed datasets from their official sources (see the [code repo data section](https://github.com/IDEA-Research/V-Reflection#data-preparation)).
+## Intended use
+- Research on **visual reasoning**, **high-resolution understanding**, and **latent multimodal reasoning**.
+- **Not** a drop-in replacement for stock Qwen2.5-VL in generic chat UIs without the official **custom model class** and generation path.
+## How to run
+Loading and evaluation rely on the **`QwenWithLVR`** implementation and scripts in the GitHub repository (training, packing, and benchmarks are documented there).
+1. Clone **[IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection)** and follow `README.md` (environment, data layout).
+2. Download this checkpoint into a local directory (or rely on `hf_hub_download` from their evaluation scripts).
+3. Point `EVAL_CHECKPOINT_PATH` (or the training script checkpoint args) to the unpacked folder.
+Example evaluation entrypoint (from upstream docs):
+```bash
+bash scripts_release/evaluation/evaluation_7b_stage2.sh
+```
+## Results (reported in project materials)
+| Benchmark | V-Reflection | Qwen2.5-VL-7B |
+|-----------|:------------:|:-------------:|
+| MMVP | **72.3** | 66.7 |
+| BLINK | **56.4** | 54.5 |
+| V* | **81.7** | 78.5 |
+| HRBench-4K | **72.6** | 68.0 |
+| HRBench-8K | **66.3** | 63.8 |
+| MME-RealWorld-Lite | **53.9** | 45.8 |
+## Limitations
+- Requires the **official codebase** for correct loading and latent reasoning behavior.
+- Performance and safety have **not** been audited for unrestricted production deployment; use judgment for real-world products.
+- Biases may mirror base model and training data distributions.
+## License
+Apache-2.0 (same as the codebase). See the [license file](https://github.com/IDEA-Research/V-Reflection/blob/main/LICENSE) in the GitHub repository.
+## Acknowledgements
+[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [LVR](https://github.com/VincentLeebang/lvr), [Visual-CoT](https://github.com/deepcs233/Visual-CoT), [InternVL](https://github.com/OpenGVLab/InternVL).