gemma-4-12B-it-assistant-mlx-bf16

The MTP (Multi-Token Prediction) speculative-decoding drafter for Gemma 4 12B IT, converted to MLX (bf16) with mlx-vlm. This is an optional add-on that accelerates decoding of the base model agentmish/gemma-4-12B-it-mlx-8bit. It is not a standalone chat model and is not required to run the base model.

Source: google/gemma-4-12B-it-assistant (Gemma4UnifiedAssistantForCausalLM, model_type=gemma4_unified_assistant).
Small drafter: 4 transformer layers, hidden size 1024, backbone_hidden_size=3840 (matches the target so it shares the target's token embeddings). ~837 MB in bf16.
Kept bf16 (not quantized) on purpose: aggressively quantizing a tiny drafter lowers its acceptance rate and erases the speculative speedup.

How it was converted

mlx-vlm 0.6.1, mlx 0.31.2:

mlx_vlm.convert --hf-path google/gemma-4-12B-it-assistant \
  --mlx-path ./gemma-4-12B-it-assistant-mlx-bf16 --dtype bfloat16

Usage — opt-in MTP speculative decoding

The drafter is enabled by adding three flags to a normal mlx_vlm.generate call. The base model runs identically without them. The best setting on this hardware is --draft-block-size 3.

python -m mlx_vlm.generate \
  --model agentmish/gemma-4-12B-it-mlx-8bit \
  --draft-model agentmish/gemma-4-12B-it-assistant-mlx-bf16 \
  --draft-kind mtp --draft-block-size 3 \
  --temperature 0.0 --max-tokens 400 \
  --prompt "Explain how unified memory helps LLM inference on Apple Silicon."

mlx_vlm prints an acceptance line, e.g. Speculative decoding: 2.91 accepted tokens/round … over N rounds.

Reminder: mlx_vlm.generate has no --top-p/--top-k; pass them via --gen-kwargs.

Measured speedup (Apple M-series, 96 GB, batch 1, greedy temp 0)

Unlike the common "batch-1 speculative decoding is neutral or slower" expectation, this official drafter has high acceptance on Gemma 4 12B and delivers a real speedup. --draft-block-size 3 is the sweet spot.

Short factual prompt (capital of France + two landmarks, 32 generated tokens):

draft-block-size	gen tok/s	speedup	accepted tok/round	output == base
baseline (no drafter)	44.02	1.00x	—	—
2	63.86	1.45x	1.94	yes
3	74.71	1.70x	2.91	yes
4	72.28	1.64x	3.67	yes
6	68.00	1.55x	3.78	yes
8	57.59	1.31x	3.78	yes

Long generative prompt (3 paragraphs, 400 generated tokens):

draft-block-size	gen tok/s	speedup	accepted tok/round	output == base
baseline (no drafter)	42.41	1.00x	—	—
2	54.50	1.29x	1.67	diverges*
3	57.68	1.36x	2.16	diverges*
4	49.47	1.17x	2.33	diverges*
6	49.70	1.17x	2.33	diverges*
8	49.50	1.17x	2.33	diverges*

Temperature-0 output equivalence — the honest version

Speculative decoding is exact in theory (the target verifies every drafted token), but on this hardware it is floating-point nondeterministic in practice:

Short outputs are byte-identical to running the base model alone, at every block size.
Long outputs diverge. Baseline and drafter share a long common prefix, then split into different but equally coherent and on-topic continuations (* diverges above).

Cause (not a bug): the drafter verifies several tokens in one batched forward pass, and Metal matmuls at that batch/sequence shape differ at the floating-point level from the one-token-at-a-time baseline. At a near-tie greedy logit the argmax flips and the sequences part ways. Short outputs hit no near-tie and stay identical; long ones eventually do. The acceptance stats (2–3 accepted tokens/round) confirm the drafter is doing real work. For bit-identical-to-baseline output, run without the drafter; for speed, use it.

License & attribution

Apache-2.0, inheriting the base. This is an MLX format conversion of google/gemma-4-12B-it-assistant (© Google, Apache-2.0); weights and architecture are Google's.

Downloads last month: 49

Safetensors

Model size

0.4B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for agentmish/gemma-4-12B-it-assistant-mlx-bf16

Base model

google/gemma-4-12B-it-assistant

Finetuned

(6)

this model