--- license: mit language: - en metrics: - accuracy base_model: - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-text-to-text library_name: transformers --- # Introduction This is the pretrained model for paper "**Monet: Reasoning in Latent Visual Space Beyond Images and Language**" **Paper:** http://arxiv.org/abs/2511.21395 **Code:** https://github.com/NOVAglow646/Monet **How to use this model:** we provide an [inference example](https://github.com/NOVAglow646/Monet/blob/main/inference/vllm_inference_example.py) in our GitHub repo. # Citation If you find this work useful, please use the following BibTeX. Thank you for your support! ```bibtex @misc{wang2025monetreasoninglatentvisual, title={Monet: Reasoning in Latent Visual Space Beyond Images and Language}, author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang}, year={2025}, eprint={2511.21395}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.21395}, } ```