Update README.md

dab37eb verified 25 days ago

4.91 kB

	---
	library_name: transformers
	pipeline_tag: image-text-to-text
	license: other
	base_model:
	- moondream/moondream3-preview
	---

	# Moondream 3 (Preview) 4-Bit

	![4bit-efficiency-gains-and-performance-tradeoffs](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/_puZs7EqffYYMFaNpaxsu.jpeg)

	Moondream 3 (Preview) 4-Bit is the INT4 quantized version of [Moondream3-Preview](https://huggingface.co/moondream/moondream3-preview), reducing model size from \~18GB to \~6GB (\~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality.

	This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM.

	## Features

	- 66% smaller: ~6GB vs ~18GB original
	- Lower memory: Runs on 7GB VRAM (vs 20GB for FP16)
	- Same capabilities: Retains original Moondream3 skills & API
	- Minimal quality loss: ~2-5% degradation on benchmarks
	- HuggingFace compatible: Load with `AutoModelForCausalLM.from_pretrained()`

	## VRAM & Time Savings

	\| Configuration \| Model Size \| VRAM usage \| s/query* \|
	\|-----------------\|------------\|------------\|----------\|
	\| FP16 (original) \| 18.5 GB \| 19,594 MiB \| 4.19 \|
	\| INT4 (this one) \| 6.18 GB \| 7,332 MiB \| 2.65 \|
	\| Reduction \| 66 % \| 62 % \| 37 % \|

	_(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)_

	## Evaluation Results

	\| Test \| time (4-bit) \| accuracy (4-bit) \| \| time (base) \| accuracy (base) \|
	\|-------------------\|--------------\|------------------\|---\|-------------\|-----------------\|
	\| vision-ai-checkup \| 156 s \| 42.8 % \| \| 223 s \| 47.2 % \|
	\| CountBenchQA \| 22.9 min \| 91.2 % \| \| 36.6 min \| 93.2 % \|


	![image (9)](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/s3m_TiW0ASZ6jVGHSivdL.webp)



	## Architecture

	Quantized Components (INT4):
	- Text attention QKV/projection layers
	- Dense MLP layers (layers 0-3)
	- MoE expert weights (layers 4-23, 64 experts each)
	- Region model encoder/decoder

	Preserved in FP16:
	- Vision encoder (SigLIP)
	- MoE routers (critical for expert selection)
	- Temperature (tau) parameters
	- LayerNorms, embeddings, LM head


	![moondream3-preview-4bit-visualization](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/8hYfJv76Q605JhTA6-qOg.jpeg)

	Slow First-Time Compile and Inference

	_A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's [correctly configured](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html). I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :\)_


	## Quick Start (HuggingFace Style)

	The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API:

	```python
	import torch
	from PIL import Image
	from transformers import AutoModelForCausalLM

	# Load quantized model (same API as original Moondream3-preview)
	moondream = AutoModelForCausalLM.from_pretrained(
	"alecccdd/moondream3-preview-4bit",
	trust_remote_code=True,
	dtype=torch.bfloat16,
	device_map={"": "cuda"},
	)
	moondream.compile() # Critical for fast inference

	# Load an image
	image = Image.open("photo.jpg")

	# Ask a question
	result = moondream.query(image=image, question="What's in this image?")
	print(result["answer"])
	```

	## Alternative: Manual Loading

	If you prefer more control, you can load the model directly:

	```python
	import torch
	from PIL import Image
	from config import MoondreamConfig
	from moondream import MoondreamModel
	from weights import load_weights

	# Load quantized model
	model = MoondreamModel(MoondreamConfig())
	load_weights("./", model, device="cuda")
	model.compile() # Critical for fast inference

	# Load an image
	image = Image.open("photo.jpg")

	# Ask a question
	result = model.query(image=image, question="What's in this image?")
	print(result["answer"])
	```

	## Skills

	API of all skills remains identical to the [original moondream3-preview model](https://huggingface.co/moondream/moondream3-preview).


	## License

	This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1.

	Original Copyright (c) M87 Labs, Inc.

	Quantization and conversion code:
	Copyright (c) 2025 Alicius Schröder