alecccdd's picture
Update README.md
dab37eb verified
---
library_name: transformers
pipeline_tag: image-text-to-text
license: other
base_model:
- moondream/moondream3-preview
---
# Moondream 3 (Preview) 4-Bit
![4bit-efficiency-gains-and-performance-tradeoffs](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/_puZs7EqffYYMFaNpaxsu.jpeg)
**Moondream 3 (Preview) 4-Bit** is the INT4 quantized version of [Moondream3-Preview](https://huggingface.co/moondream/moondream3-preview), reducing model size from \~18GB to \~6GB (\~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality.
This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM.
## Features
- **66% smaller**: ~6GB vs ~18GB original
- **Lower memory**: Runs on 7GB VRAM (vs 20GB for FP16)
- **Same capabilities**: Retains original Moondream3 skills & API
- **Minimal quality loss**: ~2-5% degradation on benchmarks
- **HuggingFace compatible**: Load with `AutoModelForCausalLM.from_pretrained()`
## VRAM & Time Savings
| Configuration | Model Size | VRAM usage | s/query* |
|-----------------|------------|------------|----------|
| FP16 (original) | 18.5 GB | 19,594 MiB | 4.19 |
| INT4 (this one) | 6.18 GB | 7,332 MiB | 2.65 |
| Reduction | **66 %** | **62 %** | **37 %** |
_(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)_
## Evaluation Results
| Test | time (4-bit) | accuracy (4-bit) | | time (base) | accuracy (base) |
|-------------------|--------------|------------------|---|-------------|-----------------|
| vision-ai-checkup | **156 s** | 42.8 % | | 223 s | **47.2 %** |
| CountBenchQA | **22.9 min** | 91.2 % | | 36.6 min | **93.2 %** |
![image (9)](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/s3m_TiW0ASZ6jVGHSivdL.webp)
## Architecture
**Quantized Components (INT4):**
- Text attention QKV/projection layers
- Dense MLP layers (layers 0-3)
- MoE expert weights (layers 4-23, 64 experts each)
- Region model encoder/decoder
**Preserved in FP16:**
- Vision encoder (SigLIP)
- MoE routers (critical for expert selection)
- Temperature (tau) parameters
- LayerNorms, embeddings, LM head
![moondream3-preview-4bit-visualization](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/8hYfJv76Q605JhTA6-qOg.jpeg)
**Slow First-Time Compile and Inference**
_A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's [correctly configured](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html). I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :\)_
## Quick Start (HuggingFace Style)
The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API:
```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
# Load quantized model (same API as original Moondream3-preview)
moondream = AutoModelForCausalLM.from_pretrained(
"alecccdd/moondream3-preview-4bit",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map={"": "cuda"},
)
moondream.compile() # Critical for fast inference
# Load an image
image = Image.open("photo.jpg")
# Ask a question
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])
```
## Alternative: Manual Loading
If you prefer more control, you can load the model directly:
```python
import torch
from PIL import Image
from config import MoondreamConfig
from moondream import MoondreamModel
from weights import load_weights
# Load quantized model
model = MoondreamModel(MoondreamConfig())
load_weights("./", model, device="cuda")
model.compile() # Critical for fast inference
# Load an image
image = Image.open("photo.jpg")
# Ask a question
result = model.query(image=image, question="What's in this image?")
print(result["answer"])
```
## Skills
API of all skills remains identical to the [original moondream3-preview model](https://huggingface.co/moondream/moondream3-preview).
## License
This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1.
Original Copyright (c) M87 Labs, Inc.
Quantization and conversion code:
Copyright (c) 2025 Alicius Schröder