|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: image-text-to-text |
|
|
license: other |
|
|
base_model: |
|
|
- moondream/moondream3-preview |
|
|
--- |
|
|
|
|
|
# Moondream 3 (Preview) 4-Bit |
|
|
|
|
|
 |
|
|
|
|
|
**Moondream 3 (Preview) 4-Bit** is the INT4 quantized version of [Moondream3-Preview](https://huggingface.co/moondream/moondream3-preview), reducing model size from \~18GB to \~6GB (\~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality. |
|
|
|
|
|
This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM. |
|
|
|
|
|
## Features |
|
|
|
|
|
- **66% smaller**: ~6GB vs ~18GB original |
|
|
- **Lower memory**: Runs on 7GB VRAM (vs 20GB for FP16) |
|
|
- **Same capabilities**: Retains original Moondream3 skills & API |
|
|
- **Minimal quality loss**: ~2-5% degradation on benchmarks |
|
|
- **HuggingFace compatible**: Load with `AutoModelForCausalLM.from_pretrained()` |
|
|
|
|
|
## VRAM & Time Savings |
|
|
|
|
|
| Configuration | Model Size | VRAM usage | s/query* | |
|
|
|-----------------|------------|------------|----------| |
|
|
| FP16 (original) | 18.5 GB | 19,594 MiB | 4.19 | |
|
|
| INT4 (this one) | 6.18 GB | 7,332 MiB | 2.65 | |
|
|
| Reduction | **66 %** | **62 %** | **37 %** | |
|
|
|
|
|
_(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)_ |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
| Test | time (4-bit) | accuracy (4-bit) | | time (base) | accuracy (base) | |
|
|
|-------------------|--------------|------------------|---|-------------|-----------------| |
|
|
| vision-ai-checkup | **156 s** | 42.8 % | | 223 s | **47.2 %** | |
|
|
| CountBenchQA | **22.9 min** | 91.2 % | | 36.6 min | **93.2 %** | |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
|
|
|
## Architecture |
|
|
|
|
|
**Quantized Components (INT4):** |
|
|
- Text attention QKV/projection layers |
|
|
- Dense MLP layers (layers 0-3) |
|
|
- MoE expert weights (layers 4-23, 64 experts each) |
|
|
- Region model encoder/decoder |
|
|
|
|
|
**Preserved in FP16:** |
|
|
- Vision encoder (SigLIP) |
|
|
- MoE routers (critical for expert selection) |
|
|
- Temperature (tau) parameters |
|
|
- LayerNorms, embeddings, LM head |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
**Slow First-Time Compile and Inference** |
|
|
|
|
|
_A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's [correctly configured](https://docs.pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html). I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :\)_ |
|
|
|
|
|
|
|
|
## Quick Start (HuggingFace Style) |
|
|
|
|
|
The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoModelForCausalLM |
|
|
|
|
|
# Load quantized model (same API as original Moondream3-preview) |
|
|
moondream = AutoModelForCausalLM.from_pretrained( |
|
|
"alecccdd/moondream3-preview-4bit", |
|
|
trust_remote_code=True, |
|
|
dtype=torch.bfloat16, |
|
|
device_map={"": "cuda"}, |
|
|
) |
|
|
moondream.compile() # Critical for fast inference |
|
|
|
|
|
# Load an image |
|
|
image = Image.open("photo.jpg") |
|
|
|
|
|
# Ask a question |
|
|
result = moondream.query(image=image, question="What's in this image?") |
|
|
print(result["answer"]) |
|
|
``` |
|
|
|
|
|
## Alternative: Manual Loading |
|
|
|
|
|
If you prefer more control, you can load the model directly: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from config import MoondreamConfig |
|
|
from moondream import MoondreamModel |
|
|
from weights import load_weights |
|
|
|
|
|
# Load quantized model |
|
|
model = MoondreamModel(MoondreamConfig()) |
|
|
load_weights("./", model, device="cuda") |
|
|
model.compile() # Critical for fast inference |
|
|
|
|
|
# Load an image |
|
|
image = Image.open("photo.jpg") |
|
|
|
|
|
# Ask a question |
|
|
result = model.query(image=image, question="What's in this image?") |
|
|
print(result["answer"]) |
|
|
``` |
|
|
|
|
|
## Skills |
|
|
|
|
|
API of all skills remains identical to the [original moondream3-preview model](https://huggingface.co/moondream/moondream3-preview). |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1. |
|
|
|
|
|
Original Copyright (c) M87 Labs, Inc. |
|
|
|
|
|
Quantization and conversion code: |
|
|
Copyright (c) 2025 Alicius Schröder |