gemma-4-12B-it-mlx-8bit

An 8-bit MLX conversion of google/gemma-4-12B-it, the encoder-free unified multimodal Gemma 4 12B instruction-tuned model (text + image + audio + video in a single decoder, model_type="gemma4_unified", 48 layers, vocab 262144, 256K context). Converted with mlx-vlm for Apple Silicon. Runs comfortably on an M-series Mac with unified memory.

Converted with mlx-vlm because the 12B is multimodal/encoder-free. mlx-lm would convert only the text decoder and silently drop the vision and audio towers — use this VLM build for the full model.

How it was converted

Resolved toolchain: mlx-vlm 0.6.1, mlx 0.31.2, transformers 5.9.0, huggingface_hub 1.17.0 (Python 3.12).

Exact command:

mlx_vlm.convert \
  --hf-path google/gemma-4-12B-it \
  --mlx-path ./gemma-4-12B-it-mlx-8bit \
  -q --q-bits 8 --q-group-size 64 --dtype bfloat16

Quantization / PLE note (why this is 8.512 bits-per-weight, not 8.0)

Gemma 4 uses Per-Layer Embeddings (PLE) with scaled embedding/linear tables. Naive low-bit quantization of those tables produces incoherent (looping / garbage) output. The mlx-vlm 0.6.1 converter applies a PLE-aware predicate: it keeps the per-layer PLE tables (per_layer_*, altup, laurel), every normalization layer, and all per-layer scalars in bf16, and quantizes the large matmuls (MLP gate/up/down, attention q/k/v/o, and the token/vision/audio embedding projections) to 8-bit (affine, group size 64). The bf16-kept tensors are why the measured average is 8.512 bits per weight rather than a flat 8.0.

This 8-bit build was validated by real generations (correct factual + reasoning answers, fluent non-repetitive prose) before publishing.

Footprint & performance (Apple M-series, 96 GB unified memory)

metric value
weights on disk ~12 GB (3 safetensors shards)
peak inference memory ~12.85 GB
generation speed ~40–43 tokens/sec (text, batch 1, no KV-cache quant)
cold model load 3.5 s from disk (1.85 s warm OS cache)

Verified working examples (proof of correctness)

This build was checked with assertion-style examples, not just eyeballed. All passed.

Text — correctness assertions (both PASS):

CHECK: capital-of-France   -> PASS
  A: The capital of France is **Paris**.  (+ Eiffel Tower, Louvre Museum)
CHECK: 17-sheep-all-but-9  -> PASS
  A: There are **9** sheep left.  ("all but 9" = every sheep except that number)
ALL CHECKS PASSED

Multimodal — vision path (proof the image tower survived conversion):

Given a sample photo of two cats, the 8-bit model returned:

Two tabby cats are resting on a bright pink blanket inside a vehicle. They are positioned near
two white remote controls, with one cat stretching out and the other curled slightly inward.

The description is accurate (two tabby cats, pink blanket, two white remotes), confirming the vision encoder converted correctly. This is the concrete reason mlx-vlm was required over mlx-lm, which would have dropped the vision and audio towers.

Usage (mlx-vlm)

Install: pip install mlx-vlm (this is a VLM, not a plain text model).

Text generation:

python -m mlx_vlm.generate --model agentmish/gemma-4-12B-it-mlx-8bit \
  --prompt "In one sentence, what is Apple MLX?" \
  --temperature 0.0 --max-tokens 200

Sampling flags: mlx_vlm.generate does not accept --top-p / --top-k (those are mlx_lm flags). Pass them through --gen-kwargs instead, or rely on the model's generation_config.json defaults (temperature 1.0, top_k 64, top_p 0.95):

python -m mlx_vlm.generate --model agentmish/gemma-4-12B-it-mlx-8bit \
  --prompt "Explain unified memory on Apple Silicon." \
  --temperature 1.0 --max-tokens 350 --gen-kwargs '{"top_p": 0.95, "top_k": 64}'

Multimodal (image example):

python -m mlx_vlm.generate --model agentmish/gemma-4-12B-it-mlx-8bit \
  --image path/or/url.jpg --prompt "Describe this image." --max-tokens 200

Chat format: Gemma 4 uses <|turn> … <turn|> turn delimiters and auto-prepends <bos> via its chat template (the template is bundled here as chat_template.jinja); do not add <bos> yourself. Thinking is opt-in (--enable-thinking); by default the template emits an empty thought channel so the model answers directly.

Benchmarks — making it snappy (honest findings)

Measured on a 96 GB Apple Silicon Mac, batch size 1, greedy decoding. The headline is that the dominant snappiness lever is the 8-bit weight quantization itself (this build) plus keeping the model resident and using a prompt cache — not KV-cache quantization.

lever effect when to use
8-bit weights (this build) ~12 GB, ~40–43 tok/s, ~3.5 s cold load always — the main win
prompt cache (repeated long context) ~19.7x faster prefill on a shared prefix (e.g. 1380 cached tokens: 2.52 s → 0.128 s; 2nd-turn TTFT ~2.7 s → ~0.31 s) multi-turn / shared-context apps
TurboQuant KV (--kv-quant-scheme turboquant --kv-bits 3.5) negligible here: only ~75 MB saved at 6.4k tokens, and decode ~1.5 tok/s slower (dequant per step) very long context / batched serving only
uniform KV quant broken for Gemma 4 — crashes (RotatingKVCache Quantization NYI) avoid; use the TurboQuant scheme instead

Why KV quantization barely helps at normal lengths: under Gemma's grouped-query + sliding-window attention the KV cache stays small, so peak memory is dominated by the one-shot prefill activation spike plus the 12 GB of weights, not by the cache. TurboQuant's benefit only scales with steady-state KV (tens of thousands of tokens, or many concurrent requests). The prompt cache is the single biggest latency lever for repeated/shared context, and it triggers when the cached prefix is strictly shorter than the new prompt (shared context + a new trailing question), not when you re-send an identical prompt.

Optional speedup — MTP speculative decoding (add-on, not required)

An optional Multi-Token-Prediction drafter is published separately at agentmish/gemma-4-12B-it-assistant-mlx-bf16. It is a true add-on: this base model runs on its own, and the drafter only accelerates decoding when you opt in with --draft-model … --draft-kind mtp --draft-block-size 3. On this hardware it gives a real ~1.70x speedup on short factual prompts and ~1.36x on long generation. See that repo's card for the full block-size sweep and an honest note on temperature-0 output equivalence.

License & attribution

This conversion is released under Apache-2.0, inheriting the license of the base model.

NOTICE

This is a format conversion (8-bit MLX quantization) of google/gemma-4-12B-it (© Google, licensed under Apache-2.0). All model weights and the underlying architecture are the work of Google; this repository only re-encodes them for the Apple MLX runtime via mlx-vlm. No model behavior was intentionally modified beyond 8-bit quantization. Refer to the base model card for intended use, capabilities, and limitations: https://huggingface.co/google/gemma-4-12B-it

Downloads last month
93
Safetensors
Model size
3B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for agentmish/gemma-4-12B-it-mlx-8bit

Quantized
(37)
this model