moondream3-preview / README.md

Add metadata

8c84740 verified 5 months ago

7.05 kB

	---
	library_name: transformers
	pipeline_tag: image-text-to-text
	license: other
	---

	Moondream 3 (Preview) is an vision language model with a mixture-of-experts architecture (9B total parameters, 2B active). This model makes no compromises, delivering state-of-the-art visual reasoning while still retaining our efficient and deployment-friendly ethos.

	[✨ Demo](https://moondream.ai/c/playground)   ·   [☁️ Cloud API](https://moondream.ai/c/docs/quickstart)   ·   [📝 Release notes](https://moondream.ai/blog/moondream-3-preview)

	![](https://huggingface.co/moondream/moondream3-preview/resolve/main/open_vocab_detect.png)
	![](https://huggingface.co/moondream/moondream3-preview/resolve/main/visual_reasoning.png)
	![](https://huggingface.co/moondream/moondream3-preview/resolve/main/point_count.png)
	![](https://huggingface.co/moondream/moondream3-preview/resolve/main/structured_outputs.png)

	## Architecture

	1. 24 layers; the first four are dense, the rest have MoE FFNs with 64 experts, 8 activated per token
	2. MoE FFNs have GeGLU architecture, with inner/gate dim of 1024. The model's hidden dim is 2048.
	3. Usable context length increased to 32K, with [a custom efficient SuperBPE tokenizer](https://huggingface.co/moondream/starmie-v1)
	4. Multi-headed attention with learned position- and data-dependent temperature scaling
	5. SigLIP-based vision encoder, with multi-crop channel concatenation for token-efficient high resolution image processing

	For more details, please refer to the [release notes]((https://moondream.ai/blog/moondream-3-preview). Or try the model out in our [playground demo](https://moondream.ai/c/playground).

	The following instructions demonstrate how to run the model locally using Transformers. We also offer a [cloud API](https://moondream.ai/c/docs/quickstart) with a generous free tier that can help you get started quicker!

	## Usage

	Load the model and prepare it for inference. We use [FlexAttention for inference](https://pytorch.org/blog/flexattention-for-inference/), so calling `.compile()` is critical for fast decoding. Our `compile` implementation also handles warmup, so you can start making requests directly once it returns.

	```python
	import torch
	from transformers import AutoModelForCausalLM

	moondream = AutoModelForCausalLM.from_pretrained(
	"moondream/moondream3-preview",
	trust_remote_code=True,
	dtype=torch.bfloat16,
	device_map={"": "cuda"},
	)
	moondream.compile()
	```

	The model comes with four skills, tailored towards different visual understanding tasks.

	### Query

	The `query` skill can be used to ask open-ended questions about images.

	```python
	from PIL import Image

	# Simple VQA
	image = Image.open("photo.jpg")
	result = moondream.query(image=image, question="What's in this image?")
	print(result["answer"])
	```

	By default, `query` runs in reasoning mode, allowing the model to "think" about the question before generating an answer. This is helpful for more complicated tasks, but sometimes the task you're running is simple and doesn't benefit from reasoning. To save on inference cost when this is the case, you can disable reasoning:

	```python
	# Without reasoning for simple questions
	result = moondream.query(
	image=image,
	question="What color is the sky?",
	reasoning=False
	)
	print(result["answer"])
	```

	If you want to stream outputs, pass in `stream=True`. You can control the temperature, top-p, and maximum number of tokens generated by passing in optional settings.

	```python
	# Streaming with custom settings
	settings = {
	"temperature": 0.7,
	"top_p": 0.95,
	"max_tokens": 512
	}

	result = moondream.query(
	image=image,
	question="Describe what's happening in detail",
	stream=True,
	settings=settings
	)

	# Stream the answer
	for chunk in result["answer"]:
	print(chunk, end="", flush=True)
	```

	Note that this isn't just for images; Moondream is also a strong general-purpose text model.

	```python
	# Text-only example (no image)
	result = moondream.query(
	question="Explain the concept of machine learning in simple terms"
	)
	print(result["answer"])
	```

	### Caption

	Whether you want short, normal-sized or long descriptions of images, the `caption` skill has you covered.

	```python
	# Different caption lengths
	image = Image.open("landscape.jpg")

	# Short caption
	short = moondream.caption(image, length="short")
	print(f"Short: {short['caption']}")

	# Normal caption (default)
	normal = moondream.caption(image, length="normal")
	print(f"Normal: {normal['caption']}")

	# Long caption
	long = moondream.caption(image, length="long")
	print(f"Long: {long['caption']}")
	```

	It accepts the same streaming and temperature etc. settings as the `query` skill.

	```python
	# Streaming caption with custom settings
	result = moondream.caption(
	image,
	length="long",
	stream=True,
	settings={"temperature": 0.3}
	)

	for chunk in result["caption"]:
	print(chunk, end="", flush=True)
	```

	### Point

	The `point` skill identifies specific points (x, y coordinates) for objects in an image.

	```python
	# Find points for specific objects
	image = Image.open("crowd.jpg")
	result = moondream.point(image, "person wearing a red shirt")

	# Points are normalized coordinates (0-1)
	for i, point in enumerate(result["points"]):
	print(f"Point {i+1}: x={point['x']:.3f}, y={point['y']:.3f}")
	```

	### Detect

	The `detect` skill provides bounding boxes for objects in an image.

	```python
	# Detect objects with bounding boxes
	image = Image.open("street_scene.jpg")
	result = moondream.detect(image, "car")

	# Bounding boxes are normalized coordinates (0-1)
	for i, obj in enumerate(result["objects"]):
	print(f"Object {i+1}: "
	f"x_min={obj['x_min']:.3f}, y_min={obj['y_min']:.3f}, "
	f"x_max={obj['x_max']:.3f}, y_max={obj['y_max']:.3f}")

	# Control maximum number of objects
	settings = {"max_objects": 10}
	result = moondream.detect(image, "person", settings=settings)
	```

	### Caching image encodings (advanced)

	If you're planning to run multiple inferences on the same image, you can pre-encode it once and reuse the encoding for better performance.

	```python
	# Encode image once
	image = Image.open("complex_scene.jpg")
	encoded = moondream.encode_image(image)

	# Reuse the encoding for multiple queries
	questions = [
	"How many people are in this image?",
	"What time of day was this taken?",
	"What's the weather like?"
	]

	for q in questions:
	result = moondream.query(image=encoded, question=q, reasoning=False)
	print(f"Q: {q}")
	print(f"A: {result['answer']}\n")

	# Also works with other skills
	caption = moondream.caption(encoded, length="normal")
	objects = moondream.detect(encoded, "vehicle")
	```

	---

	Copyright (c) 2025 M87 Labs, Inc.

	This distribution includes Model Weights licensed under the [Business Source License 1.1 with an Additional Use Grant (No Third-Party Service)](https://huggingface.co/moondream/moondream3-preview/blob/main/LICENSE.md). Commercial hosting or rehosting requires an agreement with <contact@m87.ai>.