File size: 1,075 Bytes
679b92a b262046 39c1e32 a21f601 b262046 a21f601 39c1e32 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | ---
license: mit
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
---
# Introduction
This is the pretrained model for paper "**Monet: Reasoning in Latent Visual Space Beyond Images and Language**"
**Paper:** http://arxiv.org/abs/2511.21395
**Code:** https://github.com/NOVAglow646/Monet
**How to use this model:** we provide an [inference example](https://github.com/NOVAglow646/Monet/blob/main/inference/vllm_inference_example.py) in our GitHub repo.
# Citation
If you find this work useful, please use the following BibTeX. Thank you for your support!
```bibtex
@misc{wang2025monetreasoninglatentvisual,
title={Monet: Reasoning in Latent Visual Space Beyond Images and Language},
author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang},
year={2025},
eprint={2511.21395},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.21395},
}
```
|