Michael4933
/

CapImagine-7B

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+base_model: Qwen/Qwen2.5-VL-7B-Instruct
+tags:
+- multimodal
+- visual-reasoning
+- qwen2.5-vl
+---
+# CapImagine-7B
+[**Imagination Helps Visual Reasoning, But Not Yet in Latent Space**](https://huggingface.co/papers/2602.22766)
+CapImagine-7B is a multimodal large language model fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). It is designed to enhance visual reasoning by teaching the model to explicitly "imagine" visual transformations using text-space reasoning chains (captions) rather than abstract latent tokens.
+## Resources
+- **Paper:** [Imagination Helps Visual Reasoning, But Not Yet in Latent Space](https://arxiv.org/abs/2602.22766)
+- **Repository:** [GitHub - AI9Stars/CapImagine](https://github.com/AI9Stars/CapImagine)
+- **Dataset:** [Michael4933/CapImagine-Data](https://huggingface.co/datasets/Michael4933/CapImagine-Data)
+## Model Description
+The paper investigates the validity of *latent visual reasoning*—a paradigm where models "meditate" through hidden states. Using Causal Mediation Analysis, the authors found that:
+1. **Input-Latent Disconnect**: Changes in input result in negligible changes to latent tokens.
+2. **Latent-Answer Disconnect**: Changes in latent tokens result in negligible impact on final answers.
+Consequently, the authors propose **CapImagine**, which replaces complex latent-space mediators with explicit textual descriptions of visual changes. This approach significantly outperforms latent-space baselines on vision-centric benchmarks.
+## Usage
+Since CapImagine is based on the Qwen2.5-VL architecture, inference can be implemented using the official code and templates from [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct).
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# Model loading follows the standard Qwen2.5-VL protocol
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "Michael4933/CapImagine-7B", torch_dtype="auto", device_map="auto"
+)
+processor = AutoProcessor.from_pretrained("Michael4933/CapImagine-7B")
+```
+## Citation
+If you find this work useful, please use the following BibTeX:
+```bibtex
+@misc{li2026imaginationhelpsvisualreasoning,
+      title={Imagination Helps Visual Reasoning, But Not Yet in Latent Space},
+      author={You Li and Chi Chen and Yanghao Li and Fanhu Zeng and Kaiyu Huang and Jinan Xu and Maosong Sun},
+      year={2026},
+      eprint={2602.22766},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2602.22766},
+}
+```