| | --- |
| | license: mit |
| | language: |
| | - en |
| | metrics: |
| | - accuracy |
| | base_model: |
| | - Qwen/Qwen2.5-VL-7B-Instruct |
| | pipeline_tag: image-text-to-text |
| | library_name: transformers |
| | --- |
| | |
| | # Introduction |
| | This is the pretrained model for paper "**Monet: Reasoning in Latent Visual Space Beyond Images and Language**" |
| |
|
| | **Paper:** http://arxiv.org/abs/2511.21395 |
| |
|
| | **Code:** https://github.com/NOVAglow646/Monet |
| |
|
| | **How to use this model:** we provide an [inference example](https://github.com/NOVAglow646/Monet/blob/main/inference/vllm_inference_example.py) in our GitHub repo. |
| |
|
| | # Citation |
| | If you find this work useful, please use the following BibTeX. Thank you for your support! |
| |
|
| | ```bibtex |
| | @misc{wang2025monetreasoninglatentvisual, |
| | title={Monet: Reasoning in Latent Visual Space Beyond Images and Language}, |
| | author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang}, |
| | year={2025}, |
| | eprint={2511.21395}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2511.21395}, |
| | } |
| | ``` |
| |
|
| |
|