| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - multimodal |
| - vision-language |
| - visual-reasoning |
| - latent-reasoning |
| - qwen2_5_vl |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| base_model: Qwen/Qwen2.5-VL-7B-Instruct |
| --- |
| |
| # V-Reflection (7B) |
|
|
| **V-Reflection** is a Qwen2.5-VL-based multimodal model that turns the MLLM into an **active interrogator** via a *think-then-look* **visual reflection** mechanism: fixed-length latent visual reasoning before answering. |
|
|
| | Resource | Link | |
| |----------|------| |
| | Project page | [idea-research.github.io/V-Reflection](https://idea-research.github.io/V-Reflection/) | |
| | Code & scripts | [github.com/IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection) | |
|
|
| ## Model summary |
|
|
| - **Architecture:** Built on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) with **Box-Guided Compression (BCM)** and **Dynamic Autoregressive Compression (DAC)** for latent visual probing. |
| - **Stage 1 (BCM):** Box-guided compression produces stable pixel-to-latent targets; *stochastic decoupled alignment* trains the resampler and LLM jointly. |
| - **Stage 2 (DAC):** Student maps LLM hidden states into dynamic probes over the global visual map, with MSE distillation from a frozen BCM teacher. |
| - **Inference:** BCM/DAC are inactive at inference; decoding is end-to-end autoregressive in latent space (8-step latent reasoning by default). |
|
|
| <p align="center"> |
| <img src="https://raw.githubusercontent.com/IDEA-Research/V-Reflection/main/images/Framework.png" width="85%" alt="V-Reflection framework"> |
| </p> |
|
|
| ## Repository contents |
|
|
| This Hub repo includes **model weights** (sharded Safetensors) and **pre-formatted LVR training annotations** used in the paper project: |
|
|
| | File(s) | Role | |
| |---------|------| |
| | `*.safetensors`, `config.json`, tokenizer assets | Fine-tuned **V-Reflection** checkpoint | |
| | `meta_data_lvr_sft_stage1.json` | Meta config for default SROIE + DUDE mix | |
| | `viscot_sroie_dude_lvr_formatted.json` | SROIE + DUDE subset | |
| | `viscot_363k_lvr_formatted.json` | Full Visual CoT–style 363K split | |
|
|
| Download images for Visual CoT / listed datasets from their official sources (see the [code repo data section](https://github.com/IDEA-Research/V-Reflection#data-preparation)). |
|
|
| ## Intended use |
|
|
| - Research on **visual reasoning**, **high-resolution understanding**, and **latent multimodal reasoning**. |
| - **Not** a drop-in replacement for stock Qwen2.5-VL in generic chat UIs without the official **custom model class** and generation path. |
|
|
| ## How to run |
|
|
| Loading and evaluation rely on the **`QwenWithLVR`** implementation and scripts in the GitHub repository (training, packing, and benchmarks are documented there). |
|
|
| 1. Clone **[IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection)** and follow `README.md` (environment, data layout). |
| 2. Download this checkpoint into a local directory (or rely on `hf_hub_download` from their evaluation scripts). |
| 3. Point `EVAL_CHECKPOINT_PATH` (or the training script checkpoint args) to the unpacked folder. |
|
|
| Example evaluation entrypoint (from upstream docs): |
|
|
| ```bash |
| bash scripts_release/evaluation/evaluation_7b_stage2.sh |
| ``` |
|
|
| ## Results (reported in project materials) |
|
|
| | Benchmark | V-Reflection | Qwen2.5-VL-7B | |
| |-----------|:------------:|:-------------:| |
| | MMVP | **72.3** | 66.7 | |
| | BLINK | **56.4** | 54.5 | |
| | V* | **81.7** | 78.5 | |
| | HRBench-4K | **72.6** | 68.0 | |
| | HRBench-8K | **66.3** | 63.8 | |
| | MME-RealWorld-Lite | **53.9** | 45.8 | |
|
|
| ## Limitations |
|
|
| - Requires the **official codebase** for correct loading and latent reasoning behavior. |
| - Performance and safety have **not** been audited for unrestricted production deployment; use judgment for real-world products. |
| - Biases may mirror base model and training data distributions. |
|
|
| ## License |
|
|
| Apache-2.0 (same as the codebase). See the [license file](https://github.com/IDEA-Research/V-Reflection/blob/main/LICENSE) in the GitHub repository. |
|
|
| ## Acknowledgements |
|
|
| [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [LVR](https://github.com/VincentLeebang/lvr), [Visual-CoT](https://github.com/deepcs233/Visual-CoT), [InternVL](https://github.com/OpenGVLab/InternVL). |
|
|