MD3P-Int8 - INT8 Quantized Moondream3 for MLX

An INT8 quantized version of Moondream3, offering a balance between model quality and size for MLX deployment.

Model Details

Component Original (BF16) This Model
MoE Experts (layers 4-23) BF16 int8
Vision Encoder BF16 BF16 (preserved)
Text Attention BF16 int8
Text MLP (layers 0-3) BF16 int8
Embeddings BF16 BF16 (preserved)
Total Size ~12 GB ~10 GB

Quantization Details

  • Method: Affine quantization (bits=8, group_size=64)
  • Target: Text model layers (attention, MLP, MoE experts)
  • Preserved: Vision encoder and embeddings at BF16 for quality

Comparison with INT4 Variants

Model Size Quality Use Case
md3p-int8 (this) 10 GB Higher Desktop/Server MLX
md3p-int4 6.48 GB Medium Memory-constrained
md3p-int4-smol 5.43 GB Lower iOS (~6GB limit)

Usage

This model is designed for use with MLX-based Moondream implementations.

# Example with mlx-lm or similar
from mlx_lm import load, generate

model, tokenizer = load("lewi/md3p-int8")

Source & License

Acknowledgments

Thanks to the Moondream team for the original model and Apache 2.0 license.

Downloads last month
22
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for lewi/md3p-int8

Finetuned
(1)
this model