Instructions to use NOVAglow646/Monet-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NOVAglow646/Monet-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="NOVAglow646/Monet-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("NOVAglow646/Monet-7B") model = AutoModelForImageTextToText.from_pretrained("NOVAglow646/Monet-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use NOVAglow646/Monet-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NOVAglow646/Monet-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NOVAglow646/Monet-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/NOVAglow646/Monet-7B
- SGLang
How to use NOVAglow646/Monet-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NOVAglow646/Monet-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NOVAglow646/Monet-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NOVAglow646/Monet-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NOVAglow646/Monet-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use NOVAglow646/Monet-7B with Docker Model Runner:
docker model run hf.co/NOVAglow646/Monet-7B
Add comprehensive model card for Monet-7B
Browse filesThis PR adds a comprehensive model card for the Monet-7B model, linking it to the paper [Monet: Reasoning in Latent Visual Space Beyond Images and Language](https://huggingface.co/papers/2511.21395).
It includes:
- Essential metadata: `license: cc-by-nc-4.0`, `library_name: transformers`, and `pipeline_tag: image-text-to-text`.
- Links to the official paper and the GitHub repository.
- An overview image from the GitHub repository.
- A sample usage code snippet demonstrating how to use the model with the `transformers` library, noting that `trust_remote_code=True` is required due to custom components.
- The BibTeX citation for the paper.
This update will significantly improve discoverability and ease of use for researchers and developers on the Hugging Face Hub.
|
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Monet: Reasoning in Latent Visual Space Beyond Images and Language
|
| 8 |
+
|
| 9 |
+
**Monet** is a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. It aims to achieve human-like abstract visual thinking, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps.
|
| 10 |
+
|
| 11 |
+
This model is introduced in the paper:
|
| 12 |
+
[**Monet: Reasoning in Latent Visual Space Beyond Images and Language**](https://huggingface.co/papers/2511.21395)
|
| 13 |
+
|
| 14 |
+
<p align="center">
|
| 15 |
+
<img src="https://github.com/NOVAglow646/Monet/raw/main/images/overview.png" alt="Monet Overview" width="700">
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+
## Installation and Code
|
| 19 |
+
The official implementation, training scripts, and further details can be found on the project's GitHub repository:
|
| 20 |
+
[https://github.com/NOVAglow646/Monet](https://github.com/NOVAglow646/Monet)
|
| 21 |
+
|
| 22 |
+
To set up the environment, please refer to the installation instructions in the GitHub repository. Note that the model uses customized `Qwen2.5-VL-7B` components, requiring specific modifications as detailed in the repository.
|
| 23 |
+
|
| 24 |
+
## Usage
|
| 25 |
+
The Monet-7B model can be loaded and used with the Hugging Face `transformers` library. Due to custom model components, `trust_remote_code=True` is required.
|
| 26 |
+
|
| 27 |
+
```python
|
| 28 |
+
import torch
|
| 29 |
+
from PIL import Image
|
| 30 |
+
from transformers import AutoProcessor, AutoModelForConditionalGeneration
|
| 31 |
+
|
| 32 |
+
# Load the model and processor
|
| 33 |
+
model_id = "NOVAglow646/Monet-7B" # Replace with the actual model repository name if different
|
| 34 |
+
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
| 35 |
+
model = AutoModelForConditionalGeneration.from_pretrained(
|
| 36 |
+
model_id,
|
| 37 |
+
torch_dtype=torch.bfloat16,
|
| 38 |
+
device_map="auto",
|
| 39 |
+
trust_remote_code=True
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
# Example: Image Understanding (Image-to-Text)
|
| 43 |
+
# Replace "path/to/your/image.png" with the actual path to your image file
|
| 44 |
+
# For example: image = Image.open("your_image.png")
|
| 45 |
+
# Ensure 'your_image.png' is in the same directory or provide its full path.
|
| 46 |
+
try:
|
| 47 |
+
image = Image.open("path/to/your/image.png") # Placeholder path
|
| 48 |
+
except FileNotFoundError:
|
| 49 |
+
print("Please replace 'path/to/your/image.png' with a valid image file path.")
|
| 50 |
+
exit()
|
| 51 |
+
|
| 52 |
+
# Prepare the chat messages for image understanding
|
| 53 |
+
messages = [
|
| 54 |
+
{
|
| 55 |
+
"role": "user",
|
| 56 |
+
"content": [
|
| 57 |
+
{"type": "image", "content": image},
|
| 58 |
+
{"type": "text", "text": "Describe the image in detail."},
|
| 59 |
+
]
|
| 60 |
+
}
|
| 61 |
+
]
|
| 62 |
+
|
| 63 |
+
# Apply the chat template and process inputs
|
| 64 |
+
prompt_text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
|
| 65 |
+
inputs = processor(text=prompt_text, images=image, return_tensors="pt").to(model.device)
|
| 66 |
+
|
| 67 |
+
# Generate output
|
| 68 |
+
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
|
| 69 |
+
|
| 70 |
+
# Decode and print the response
|
| 71 |
+
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
|
| 72 |
+
print(response)
|
| 73 |
+
|
| 74 |
+
# The model is also capable of other tasks like Text-to-Image and Omni-Potent multimodal interactions.
|
| 75 |
+
# Refer to the GitHub repository for more advanced usage examples and demos,
|
| 76 |
+
# especially regarding latent reasoning with `<abs_vis_token>`.
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## Citation
|
| 80 |
+
If you find our work helpful or inspiring, please feel free to cite it:
|
| 81 |
+
```bibtex
|
| 82 |
+
@misc{wang2025monetreasoninglatentvisual,
|
| 83 |
+
title={Monet: Reasoning in Latent Visual Space Beyond Images and Language},
|
| 84 |
+
author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang},
|
| 85 |
+
year={2025},
|
| 86 |
+
eprint={2511.21395},
|
| 87 |
+
archivePrefix={arXiv},
|
| 88 |
+
primaryClass={cs.CV},
|
| 89 |
+
url={https://arxiv.org/abs/2511.21395},
|
| 90 |
+
}
|
| 91 |
+
```
|