File size: 1,075 Bytes
679b92a
 
 
 
 
 
 
 
 
 
b262046
 
39c1e32
a21f601
b262046
a21f601
 
 
 
 
39c1e32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
---
license: mit
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
---

# Introduction
This is the pretrained model for paper "**Monet: Reasoning in Latent Visual Space Beyond Images and Language**"

**Paper:** http://arxiv.org/abs/2511.21395

**Code:** https://github.com/NOVAglow646/Monet

**How to use this model:** we provide an [inference example](https://github.com/NOVAglow646/Monet/blob/main/inference/vllm_inference_example.py) in our GitHub repo.

# Citation
If you find this work useful, please use the following BibTeX. Thank you for your support!

```bibtex
@misc{wang2025monetreasoninglatentvisual,
      title={Monet: Reasoning in Latent Visual Space Beyond Images and Language}, 
      author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang},
      year={2025},
      eprint={2511.21395},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21395}, 
}
```