Expand model card (metadata, summary, usage, benchmarks)

29ff98f verified 4 days ago

4.21 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- multimodal
	- vision-language
	- visual-reasoning
	- latent-reasoning
	- qwen2_5_vl
	library_name: transformers
	pipeline_tag: image-text-to-text
	base_model: Qwen/Qwen2.5-VL-7B-Instruct
	---

	# V-Reflection (7B)

	V-Reflection is a Qwen2.5-VL-based multimodal model that turns the MLLM into an active interrogator via a think-then-look visual reflection mechanism: fixed-length latent visual reasoning before answering.

	\| Resource \| Link \|
	\|----------\|------\|
	\| Project page \| [idea-research.github.io/V-Reflection](https://idea-research.github.io/V-Reflection/) \|
	\| Code & scripts \| [github.com/IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection) \|

	## Model summary

	- Architecture: Built on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) with Box-Guided Compression (BCM) and Dynamic Autoregressive Compression (DAC) for latent visual probing.
	- Stage 1 (BCM): Box-guided compression produces stable pixel-to-latent targets; stochastic decoupled alignment trains the resampler and LLM jointly.
	- Stage 2 (DAC): Student maps LLM hidden states into dynamic probes over the global visual map, with MSE distillation from a frozen BCM teacher.
	- Inference: BCM/DAC are inactive at inference; decoding is end-to-end autoregressive in latent space (8-step latent reasoning by default).

	<p align="center">
	<img src="https://raw.githubusercontent.com/IDEA-Research/V-Reflection/main/images/Framework.png" width="85%" alt="V-Reflection framework">
	</p>

	## Repository contents

	This Hub repo includes model weights (sharded Safetensors) and pre-formatted LVR training annotations used in the paper project:

	\| File(s) \| Role \|
	\|---------\|------\|
	\| `.safetensors`, `config.json`, tokenizer assets \| Fine-tuned V-Reflection* checkpoint \|
	\| `meta_data_lvr_sft_stage1.json` \| Meta config for default SROIE + DUDE mix \|
	\| `viscot_sroie_dude_lvr_formatted.json` \| SROIE + DUDE subset \|
	\| `viscot_363k_lvr_formatted.json` \| Full Visual CoT–style 363K split \|

	Download images for Visual CoT / listed datasets from their official sources (see the [code repo data section](https://github.com/IDEA-Research/V-Reflection#data-preparation)).

	## Intended use

	- Research on visual reasoning, high-resolution understanding, and latent multimodal reasoning.
	- Not a drop-in replacement for stock Qwen2.5-VL in generic chat UIs without the official custom model class and generation path.

	## How to run

	Loading and evaluation rely on the `QwenWithLVR` implementation and scripts in the GitHub repository (training, packing, and benchmarks are documented there).

	1. Clone [IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection) and follow `README.md` (environment, data layout).
	2. Download this checkpoint into a local directory (or rely on `hf_hub_download` from their evaluation scripts).
	3. Point `EVAL_CHECKPOINT_PATH` (or the training script checkpoint args) to the unpacked folder.

	Example evaluation entrypoint (from upstream docs):

	```bash
	bash scripts_release/evaluation/evaluation_7b_stage2.sh
	```

	## Results (reported in project materials)

	\| Benchmark \| V-Reflection \| Qwen2.5-VL-7B \|
	\|-----------\|:------------:\|:-------------:\|
	\| MMVP \| 72.3 \| 66.7 \|
	\| BLINK \| 56.4 \| 54.5 \|
	\| V* \| 81.7 \| 78.5 \|
	\| HRBench-4K \| 72.6 \| 68.0 \|
	\| HRBench-8K \| 66.3 \| 63.8 \|
	\| MME-RealWorld-Lite \| 53.9 \| 45.8 \|

	## Limitations

	- Requires the official codebase for correct loading and latent reasoning behavior.
	- Performance and safety have not been audited for unrestricted production deployment; use judgment for real-world products.
	- Biases may mirror base model and training data distributions.

	## License

	Apache-2.0 (same as the codebase). See the [license file](https://github.com/IDEA-Research/V-Reflection/blob/main/LICENSE) in the GitHub repository.

	## Acknowledgements

	[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [LVR](https://github.com/VincentLeebang/lvr), [Visual-CoT](https://github.com/deepcs233/Visual-CoT), [InternVL](https://github.com/OpenGVLab/InternVL).