File size: 8,795 Bytes
20e6e81
 
80f4922
 
20e6e81
 
8e30f41
a868c19
7e8fbb9
704a11b
1ac5e95
f484a65
ddeb778
 
86c74ad
6397c64
704a11b
 
 
 
 
d010875
704a11b
7e8fbb9
aec2f91
17012f3
 
aec2f91
 
d1b7c10
 
638927a
 
 
 
 
 
 
 
 
 
 
d1b7c10
 
05d8a39
 
 
 
 
 
34f9fb7
 
 
 
 
 
 
 
05d8a39
 
 
34f9fb7
 
 
 
 
 
 
 
 
05d8a39
 
 
34f9fb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05d8a39
 
 
34f9fb7
 
 
 
 
 
 
05d8a39
 
 
 
 
34f9fb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05d8a39
 
 
34f9fb7
 
 
 
 
 
 
 
 
 
 
 
 
92c33da
9a0f51c
34f9fb7
 
 
 
 
 
 
 
 
 
 
9a0f51c
92c33da
05d8a39
34f9fb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92c33da
34f9fb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
793ccb6
 
 
 
 
 
b80ac52
 
805667b
b80ac52
805667b
b80ac52
805667b
 
 
 
b80ac52
805667b
b80ac52
805667b
e86382f
805667b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
library_name: transformers
pipeline_tag: image-text-to-text
license: other
---

**Moondream 3 (Preview)** is an vision language model with a mixture-of-experts architecture (9B total parameters, 2B active). This model makes no compromises, delivering state-of-the-art visual reasoning while still retaining our efficient and deployment-friendly ethos.

[✨ Demo](https://moondream.ai/c/playground)   ·   [☁️ Cloud API](https://moondream.ai/c/docs/quickstart)   ·   [📝 Release notes](https://moondream.ai/blog/moondream-3-preview)

![](https://huggingface.co/moondream/moondream3-preview/resolve/main/open_vocab_detect.png)
![](https://huggingface.co/moondream/moondream3-preview/resolve/main/visual_reasoning.png)
![](https://huggingface.co/moondream/moondream3-preview/resolve/main/point_count.png)
![](https://huggingface.co/moondream/moondream3-preview/resolve/main/structured_outputs.png)

## Architecture

1. 24 layers; the first four are dense, the rest have MoE FFNs with 64 experts, 8 activated per token
2. MoE FFNs have GeGLU architecture, with inner/gate dim of 1024. The model's hidden dim is 2048.
3. Usable context length increased to 32K, with [a custom efficient SuperBPE tokenizer](https://huggingface.co/moondream/starmie-v1)
4. Multi-headed attention with learned position- and data-dependent temperature scaling
5. SigLIP-based vision encoder, with multi-crop channel concatenation for token-efficient high resolution image processing

For more details, please refer to the [release notes]((https://moondream.ai/blog/moondream-3-preview). Or try the model out in our [playground demo](https://moondream.ai/c/playground).

The following instructions demonstrate how to run the model locally using Transformers. We also offer a [cloud API](https://moondream.ai/c/docs/quickstart) with a generous free tier that can help you get started quicker!

## Usage

Load the model and prepare it for inference. We use [FlexAttention for inference](https://pytorch.org/blog/flexattention-for-inference/), so calling `.compile()` is critical for fast decoding. Our `compile` implementation also handles warmup, so you can start making requests directly once it returns.

```python
import torch
from transformers import AutoModelForCausalLM

moondream = AutoModelForCausalLM.from_pretrained(
    "moondream/moondream3-preview",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
moondream.compile()
```

The model comes with four skills, tailored towards different visual understanding tasks.

### Query

The `query` skill can be used to ask open-ended questions about images.

```python
from PIL import Image

# Simple VQA
image = Image.open("photo.jpg")
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])
```

By default, `query` runs in reasoning mode, allowing the model to "think" about the question before generating an answer. This is helpful for more complicated tasks, but sometimes the task you're running is simple and doesn't benefit from reasoning. To save on inference cost when this is the case, you can disable reasoning:

```python
# Without reasoning for simple questions
result = moondream.query(
    image=image, 
    question="What color is the sky?",
    reasoning=False
)
print(result["answer"])
```

If you want to stream outputs, pass in `stream=True`. You can control the temperature, top-p, and maximum number of tokens generated by passing in optional settings.

```python
# Streaming with custom settings
settings = {
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 512
}

result = moondream.query(
    image=image,
    question="Describe what's happening in detail",
    stream=True,
    settings=settings
)

# Stream the answer
for chunk in result["answer"]:
    print(chunk, end="", flush=True)
```

Note that this isn't just for images; Moondream is also a strong general-purpose text model.

```python
# Text-only example (no image)
result = moondream.query(
    question="Explain the concept of machine learning in simple terms"
)
print(result["answer"])
```

### Caption

Whether you want short, normal-sized or long descriptions of images, the `caption` skill has you covered.

```python
# Different caption lengths
image = Image.open("landscape.jpg")

# Short caption
short = moondream.caption(image, length="short")
print(f"Short: {short['caption']}")

# Normal caption (default)
normal = moondream.caption(image, length="normal")
print(f"Normal: {normal['caption']}")

# Long caption
long = moondream.caption(image, length="long")
print(f"Long: {long['caption']}")
```

It accepts the same streaming and temperature etc. settings as the `query` skill.

```python
# Streaming caption with custom settings
result = moondream.caption(
    image,
    length="long",
    stream=True,
    settings={"temperature": 0.3}
)

for chunk in result["caption"]:
    print(chunk, end="", flush=True)
```

### Point

The `point` skill identifies specific points (x, y coordinates) for objects in an image.

```python
# Find points for specific objects
image = Image.open("crowd.jpg")
result = moondream.point(image, "person wearing a red shirt")

# Points are normalized coordinates (0-1)
for i, point in enumerate(result["points"]):
    print(f"Point {i+1}: x={point['x']:.3f}, y={point['y']:.3f}")
```

### Detect

The `detect` skill provides bounding boxes for objects in an image.

```python
# Detect objects with bounding boxes
image = Image.open("street_scene.jpg")
result = moondream.detect(image, "car")

# Bounding boxes are normalized coordinates (0-1)
for i, obj in enumerate(result["objects"]):
    print(f"Object {i+1}: "
          f"x_min={obj['x_min']:.3f}, y_min={obj['y_min']:.3f}, "
          f"x_max={obj['x_max']:.3f}, y_max={obj['y_max']:.3f}")

# Control maximum number of objects
settings = {"max_objects": 10}
result = moondream.detect(image, "person", settings=settings)
```

### Caching image encodings (advanced)

If you're planning to run multiple inferences on the same image, you can pre-encode it once and reuse the encoding for better performance.

```python
# Encode image once
image = Image.open("complex_scene.jpg")
encoded = moondream.encode_image(image)

# Reuse the encoding for multiple queries
questions = [
    "How many people are in this image?",
    "What time of day was this taken?",
    "What's the weather like?"
]

for q in questions:
    result = moondream.query(image=encoded, question=q, reasoning=False)
    print(f"Q: {q}")
    print(f"A: {result['answer']}\n")

# Also works with other skills
caption = moondream.caption(encoded, length="normal")
objects = moondream.detect(encoded, "vehicle")
```

---

Copyright (c) 2025 M87 Labs, Inc.

This distribution includes Model Weights licensed under the [Business Source License 1.1 with an Additional Use Grant (No Third-Party Service)](https://huggingface.co/moondream/moondream3-preview/blob/main/LICENSE.md).

TL;DR — You can use Moondream 3 (Preview) freely for personal, research, and most commercial uses. What’s NOT allowed without a separate deal is offering a paid product that competes with M87 Labs’ paid versions (e.g., selling hosted or embedded access to the model’s capabilities to third parties).

What’s allowed (no special agreement needed):

* Internal use at your company, including production use (affiliates count as the same org).
* Personal projects, research, benchmarks, fine-tunes, merges, quantizations, weight deltas.
* Using the model inside your product when it does not substantially overlap with M87’s paid offerings.
* Free/zero-price services (free community demos, noncommercial tools).

What requires an agreement (because it competes with M87’s paid offerings):

* Selling a hosted API that exposes similar vision or general AI capabilities.
* Managed hosting or “Moondream-as-a-service” for customers.
* Embedding the weights/code in a paid SDK or appliance that delivers comparable capabilities.
* B2B offerings for computer vision, data labeling, or generic AI APIs that meaningfully overlap with M87’s paid versions.

**Examples**

Allowed:
* “Run it on our servers for employees across our company.” ✔
* “Use it in our consumer photo app to auto-tag images.” ✔ (not a competing B2B offering)
* “Publish a free demo site for the community.” ✔

Requires an agreement:
* “Sell a computer-vision API to enterprise customers.” ✖
* “Offer managed hosting of this model for other companies.” ✖
* “Ship a paid SDK that bundles these weights for third-party apps.” ✖

This summary is for convenience only. The Business Source License 1.1 and the Additional Use Grant in the repository control. Questions or commercial licensing: contact@m87.ai.