Gemma 4 12B Instruction-Tuned — MLX (Apple Silicon)

Local MLX quants of google/gemma-4-12B-it for native inference on Apple Silicon via mlx-vlm.


Parameters	~12B dense
Modalities	Text, vision, audio (native in backbone)
License	Apache 2.0
Runtime	`mlx-vlm` (not `mlx-lm` — Gemma 4 is multimodal)
Format	MLX safetensors, one subfolder per quant

Also available: Edmon02/gemma-4-12B-it-GGUF for llama.cpp / LM Studio.

Why this repo exists

One download hub for curated MLX quants (4bit, mxfp4, 6bit, 8bit).
PLE-safe conversion from the official Google checkpoint with mlx-vlm >= 0.6.0.
Documented recipes in gemma-4-12b-local.

Available quants

See mlx-manifest.json for the live file list.

Subfolder	Use
`4bit/`	Default — best balance on 16 GB unified memory
`mxfp4/`	Apple-optimized 4-bit; often fastest on M-series
`6bit/`	Higher quality
`8bit/`	Max quality that still fits ~16 GB at inference

Load a specific quant by downloading its subfolder or pointing load() at the local path.

Download

pip install -U mlx-vlm huggingface_hub

# Recommended quant (4bit)
huggingface-cli download Edmon02/gemma-4-12B-it-MLX 4bit/ --local-dir ./models/gemma-4-12b-mlx

Accept the license on google/gemma-4-12B-it before using weights.

Quick start

Text chat (CLI)

python -m mlx_vlm.generate \
  --model ./models/gemma-4-12b-mlx/4bit \
  --prompt "List three benefits of encoder-free multimodal models." \
  --max-tokens 256 --temperature 0.7

Text chat (Python — use chat template)

Gemma 4 requires the chat template; generate() does not apply it automatically:

from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("./models/gemma-4-12b-mlx/4bit")
prompt = apply_chat_template(
    processor,
    model.config,
    [{"role": "user", "content": "Hello!"}],
    add_generation_prompt=True,
)
print(generate(model=model, processor=processor, prompt=prompt, max_tokens=256).text)

Vision (image + text)

python -m mlx_vlm.generate \
  --model ./models/gemma-4-12b-mlx/4bit \
  --prompt "Describe this image in one sentence." \
  --image photo.jpg \
  --max-tokens 128

Hardware guide

Unified memory	Suggested quant
8 GB	`4bit/` only, short context
16 GB	`4bit/` or `mxfp4/`
24 GB+	`6bit/` or `8bit/`

Provenance

Item	Source
Base model	`google/gemma-4-12B-it`
Conversion	Local `mlx_vlm.convert` via `scripts/convert_gemma4_mlx_quants.py`
Maintainer	Edmon02/audio_set

Limitations

Converted locally — validate quality on your tasks vs official BF16.
Audio support depends on your mlx-vlm version; confirm processor_config.json is present.
Gated upstream — HF token + license acceptance required for google/* repos.

Citation

@article{gemma_2026,
  title={Gemma 4},
  author={Google DeepMind},
  year={2026},
  url={https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12B/}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Edmon02/gemma-4-12B-it-MLX

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Finetuned

(100)

this model