Add model card for CapImagine-7B

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +57 -3
README.md CHANGED
@@ -1,3 +1,57 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ base_model: Qwen/Qwen2.5-VL-7B-Instruct
6
+ tags:
7
+ - multimodal
8
+ - visual-reasoning
9
+ - qwen2.5-vl
10
+ ---
11
+
12
+ # CapImagine-7B
13
+
14
+ [**Imagination Helps Visual Reasoning, But Not Yet in Latent Space**](https://huggingface.co/papers/2602.22766)
15
+
16
+ CapImagine-7B is a multimodal large language model fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is designed to enhance visual reasoning by teaching the model to explicitly "imagine" visual transformations using text-space reasoning chains (captions) rather than abstract latent tokens.
17
+
18
+ ## Resources
19
+ - **Paper:** [Imagination Helps Visual Reasoning, But Not Yet in Latent Space](https://arxiv.org/abs/2602.22766)
20
+ - **Repository:** [GitHub - AI9Stars/CapImagine](https://github.com/AI9Stars/CapImagine)
21
+ - **Dataset:** [Michael4933/CapImagine-Data](https://huggingface.co/datasets/Michael4933/CapImagine-Data)
22
+
23
+ ## Model Description
24
+ The paper investigates the validity of *latent visual reasoning*—a paradigm where models "meditate" through hidden states. Using Causal Mediation Analysis, the authors found that:
25
+ 1. **Input-Latent Disconnect**: Changes in input result in negligible changes to latent tokens.
26
+ 2. **Latent-Answer Disconnect**: Changes in latent tokens result in negligible impact on final answers.
27
+
28
+ Consequently, the authors propose **CapImagine**, which replaces complex latent-space mediators with explicit textual descriptions of visual changes. This approach significantly outperforms latent-space baselines on vision-centric benchmarks.
29
+
30
+ ## Usage
31
+ Since CapImagine is based on the Qwen2.5-VL architecture, inference can be implemented using the official code and templates from [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct).
32
+
33
+ ```python
34
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
35
+ from qwen_vl_utils import process_vision_info
36
+
37
+ # Model loading follows the standard Qwen2.5-VL protocol
38
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
39
+ "Michael4933/CapImagine-7B", torch_dtype="auto", device_map="auto"
40
+ )
41
+ processor = AutoProcessor.from_pretrained("Michael4933/CapImagine-7B")
42
+ ```
43
+
44
+ ## Citation
45
+ If you find this work useful, please use the following BibTeX:
46
+
47
+ ```bibtex
48
+ @misc{li2026imaginationhelpsvisualreasoning,
49
+ title={Imagination Helps Visual Reasoning, But Not Yet in Latent Space},
50
+ author={You Li and Chi Chen and Yanghao Li and Fanhu Zeng and Kaiyu Huang and Jinan Xu and Maosong Sun},
51
+ year={2026},
52
+ eprint={2602.22766},
53
+ archivePrefix={arXiv},
54
+ primaryClass={cs.CL},
55
+ url={https://arxiv.org/abs/2602.22766},
56
+ }
57
+ ```