garlandchou commited on
Commit
29ff98f
·
verified ·
1 Parent(s): 9f466f6

Expand model card (metadata, summary, usage, benchmarks)

Browse files
Files changed (1) hide show
  1. README.md +91 -3
README.md CHANGED
@@ -1,3 +1,91 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - multimodal
7
+ - vision-language
8
+ - visual-reasoning
9
+ - latent-reasoning
10
+ - qwen2_5_vl
11
+ library_name: transformers
12
+ pipeline_tag: image-text-to-text
13
+ base_model: Qwen/Qwen2.5-VL-7B-Instruct
14
+ ---
15
+
16
+ # V-Reflection (7B)
17
+
18
+ **V-Reflection** is a Qwen2.5-VL-based multimodal model that turns the MLLM into an **active interrogator** via a *think-then-look* **visual reflection** mechanism: fixed-length latent visual reasoning before answering.
19
+
20
+ | Resource | Link |
21
+ |----------|------|
22
+ | Project page | [idea-research.github.io/V-Reflection](https://idea-research.github.io/V-Reflection/) |
23
+ | Code & scripts | [github.com/IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection) |
24
+
25
+ ## Model summary
26
+
27
+ - **Architecture:** Built on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) with **Box-Guided Compression (BCM)** and **Dynamic Autoregressive Compression (DAC)** for latent visual probing.
28
+ - **Stage 1 (BCM):** Box-guided compression produces stable pixel-to-latent targets; *stochastic decoupled alignment* trains the resampler and LLM jointly.
29
+ - **Stage 2 (DAC):** Student maps LLM hidden states into dynamic probes over the global visual map, with MSE distillation from a frozen BCM teacher.
30
+ - **Inference:** BCM/DAC are inactive at inference; decoding is end-to-end autoregressive in latent space (8-step latent reasoning by default).
31
+
32
+ <p align="center">
33
+ <img src="https://raw.githubusercontent.com/IDEA-Research/V-Reflection/main/images/Framework.png" width="85%" alt="V-Reflection framework">
34
+ </p>
35
+
36
+ ## Repository contents
37
+
38
+ This Hub repo includes **model weights** (sharded Safetensors) and **pre-formatted LVR training annotations** used in the paper project:
39
+
40
+ | File(s) | Role |
41
+ |---------|------|
42
+ | `*.safetensors`, `config.json`, tokenizer assets | Fine-tuned **V-Reflection** checkpoint |
43
+ | `meta_data_lvr_sft_stage1.json` | Meta config for default SROIE + DUDE mix |
44
+ | `viscot_sroie_dude_lvr_formatted.json` | SROIE + DUDE subset |
45
+ | `viscot_363k_lvr_formatted.json` | Full Visual CoT–style 363K split |
46
+
47
+ Download images for Visual CoT / listed datasets from their official sources (see the [code repo data section](https://github.com/IDEA-Research/V-Reflection#data-preparation)).
48
+
49
+ ## Intended use
50
+
51
+ - Research on **visual reasoning**, **high-resolution understanding**, and **latent multimodal reasoning**.
52
+ - **Not** a drop-in replacement for stock Qwen2.5-VL in generic chat UIs without the official **custom model class** and generation path.
53
+
54
+ ## How to run
55
+
56
+ Loading and evaluation rely on the **`QwenWithLVR`** implementation and scripts in the GitHub repository (training, packing, and benchmarks are documented there).
57
+
58
+ 1. Clone **[IDEA-Research/V-Reflection](https://github.com/IDEA-Research/V-Reflection)** and follow `README.md` (environment, data layout).
59
+ 2. Download this checkpoint into a local directory (or rely on `hf_hub_download` from their evaluation scripts).
60
+ 3. Point `EVAL_CHECKPOINT_PATH` (or the training script checkpoint args) to the unpacked folder.
61
+
62
+ Example evaluation entrypoint (from upstream docs):
63
+
64
+ ```bash
65
+ bash scripts_release/evaluation/evaluation_7b_stage2.sh
66
+ ```
67
+
68
+ ## Results (reported in project materials)
69
+
70
+ | Benchmark | V-Reflection | Qwen2.5-VL-7B |
71
+ |-----------|:------------:|:-------------:|
72
+ | MMVP | **72.3** | 66.7 |
73
+ | BLINK | **56.4** | 54.5 |
74
+ | V* | **81.7** | 78.5 |
75
+ | HRBench-4K | **72.6** | 68.0 |
76
+ | HRBench-8K | **66.3** | 63.8 |
77
+ | MME-RealWorld-Lite | **53.9** | 45.8 |
78
+
79
+ ## Limitations
80
+
81
+ - Requires the **official codebase** for correct loading and latent reasoning behavior.
82
+ - Performance and safety have **not** been audited for unrestricted production deployment; use judgment for real-world products.
83
+ - Biases may mirror base model and training data distributions.
84
+
85
+ ## License
86
+
87
+ Apache-2.0 (same as the codebase). See the [license file](https://github.com/IDEA-Research/V-Reflection/blob/main/LICENSE) in the GitHub repository.
88
+
89
+ ## Acknowledgements
90
+
91
+ [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [LVR](https://github.com/VincentLeebang/lvr), [Visual-CoT](https://github.com/deepcs233/Visual-CoT), [InternVL](https://github.com/OpenGVLab/InternVL).