Gemma 4 12B Instruction-Tuned — MLX (Apple Silicon)

Local MLX quants of google/gemma-4-12B-it for native inference on Apple Silicon via mlx-vlm.

Parameters ~12B dense
Modalities Text, vision, audio (native in backbone)
License Apache 2.0
Runtime mlx-vlm (not mlx-lm — Gemma 4 is multimodal)
Format MLX safetensors, one subfolder per quant

Also available: Edmon02/gemma-4-12B-it-GGUF for llama.cpp / LM Studio.

Why this repo exists

  • One download hub for curated MLX quants (4bit, mxfp4, 6bit, 8bit).
  • PLE-safe conversion from the official Google checkpoint with mlx-vlm >= 0.6.0.
  • Documented recipes in gemma-4-12b-local.

Available quants

See mlx-manifest.json for the live file list.

Subfolder Use
4bit/ Default — best balance on 16 GB unified memory
mxfp4/ Apple-optimized 4-bit; often fastest on M-series
6bit/ Higher quality
8bit/ Max quality that still fits ~16 GB at inference

Load a specific quant by downloading its subfolder or pointing load() at the local path.

Download

pip install -U mlx-vlm huggingface_hub

# Recommended quant (4bit)
huggingface-cli download Edmon02/gemma-4-12B-it-MLX 4bit/ --local-dir ./models/gemma-4-12b-mlx

Accept the license on google/gemma-4-12B-it before using weights.

Quick start

Text chat (CLI)

python -m mlx_vlm.generate \
  --model ./models/gemma-4-12b-mlx/4bit \
  --prompt "List three benefits of encoder-free multimodal models." \
  --max-tokens 256 --temperature 0.7

Text chat (Python — use chat template)

Gemma 4 requires the chat template; generate() does not apply it automatically:

from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("./models/gemma-4-12b-mlx/4bit")
prompt = apply_chat_template(
    processor,
    model.config,
    [{"role": "user", "content": "Hello!"}],
    add_generation_prompt=True,
)
print(generate(model=model, processor=processor, prompt=prompt, max_tokens=256).text)

Vision (image + text)

python -m mlx_vlm.generate \
  --model ./models/gemma-4-12b-mlx/4bit \
  --prompt "Describe this image in one sentence." \
  --image photo.jpg \
  --max-tokens 128

Hardware guide

Unified memory Suggested quant
8 GB 4bit/ only, short context
16 GB 4bit/ or mxfp4/
24 GB+ 6bit/ or 8bit/

Provenance

Item Source
Base model google/gemma-4-12B-it
Conversion Local mlx_vlm.convert via scripts/convert_gemma4_mlx_quants.py
Maintainer Edmon02/audio_set

Limitations

  • Converted locally — validate quality on your tasks vs official BF16.
  • Audio support depends on your mlx-vlm version; confirm processor_config.json is present.
  • Gated upstream — HF token + license acceptance required for google/* repos.

Citation

@article{gemma_2026,
  title={Gemma 4},
  author={Google DeepMind},
  year={2026},
  url={https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12B/}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Edmon02/gemma-4-12B-it-MLX

Finetuned
(100)
this model