gemma-4-12B-it-assistant-mlx-bf16

The MTP (Multi-Token Prediction) speculative-decoding drafter for Gemma 4 12B IT, converted to MLX (bf16) with mlx-vlm. This is an optional add-on that accelerates decoding of the base model agentmish/gemma-4-12B-it-mlx-8bit. It is not a standalone chat model and is not required to run the base model.

  • Source: google/gemma-4-12B-it-assistant (Gemma4UnifiedAssistantForCausalLM, model_type=gemma4_unified_assistant).
  • Small drafter: 4 transformer layers, hidden size 1024, backbone_hidden_size=3840 (matches the target so it shares the target's token embeddings). ~837 MB in bf16.
  • Kept bf16 (not quantized) on purpose: aggressively quantizing a tiny drafter lowers its acceptance rate and erases the speculative speedup.

How it was converted

mlx-vlm 0.6.1, mlx 0.31.2:

mlx_vlm.convert --hf-path google/gemma-4-12B-it-assistant \
  --mlx-path ./gemma-4-12B-it-assistant-mlx-bf16 --dtype bfloat16

Usage β€” opt-in MTP speculative decoding

The drafter is enabled by adding three flags to a normal mlx_vlm.generate call. The base model runs identically without them. The best setting on this hardware is --draft-block-size 3.

python -m mlx_vlm.generate \
  --model agentmish/gemma-4-12B-it-mlx-8bit \
  --draft-model agentmish/gemma-4-12B-it-assistant-mlx-bf16 \
  --draft-kind mtp --draft-block-size 3 \
  --temperature 0.0 --max-tokens 400 \
  --prompt "Explain how unified memory helps LLM inference on Apple Silicon."

mlx_vlm prints an acceptance line, e.g. Speculative decoding: 2.91 accepted tokens/round … over N rounds.

Reminder: mlx_vlm.generate has no --top-p/--top-k; pass them via --gen-kwargs.

Measured speedup (Apple M-series, 96 GB, batch 1, greedy temp 0)

Unlike the common "batch-1 speculative decoding is neutral or slower" expectation, this official drafter has high acceptance on Gemma 4 12B and delivers a real speedup. --draft-block-size 3 is the sweet spot.

Short factual prompt (capital of France + two landmarks, 32 generated tokens):

draft-block-size gen tok/s speedup accepted tok/round output == base
baseline (no drafter) 44.02 1.00x β€” β€”
2 63.86 1.45x 1.94 yes
3 74.71 1.70x 2.91 yes
4 72.28 1.64x 3.67 yes
6 68.00 1.55x 3.78 yes
8 57.59 1.31x 3.78 yes

Long generative prompt (3 paragraphs, 400 generated tokens):

draft-block-size gen tok/s speedup accepted tok/round output == base
baseline (no drafter) 42.41 1.00x β€” β€”
2 54.50 1.29x 1.67 diverges*
3 57.68 1.36x 2.16 diverges*
4 49.47 1.17x 2.33 diverges*
6 49.70 1.17x 2.33 diverges*
8 49.50 1.17x 2.33 diverges*

Temperature-0 output equivalence β€” the honest version

Speculative decoding is exact in theory (the target verifies every drafted token), but on this hardware it is floating-point nondeterministic in practice:

  • Short outputs are byte-identical to running the base model alone, at every block size.
  • Long outputs diverge. Baseline and drafter share a long common prefix, then split into different but equally coherent and on-topic continuations (* diverges above).

Cause (not a bug): the drafter verifies several tokens in one batched forward pass, and Metal matmuls at that batch/sequence shape differ at the floating-point level from the one-token-at-a-time baseline. At a near-tie greedy logit the argmax flips and the sequences part ways. Short outputs hit no near-tie and stay identical; long ones eventually do. The acceptance stats (2–3 accepted tokens/round) confirm the drafter is doing real work. For bit-identical-to-baseline output, run without the drafter; for speed, use it.

License & attribution

Apache-2.0, inheriting the base. This is an MLX format conversion of google/gemma-4-12B-it-assistant (Β© Google, Apache-2.0); weights and architecture are Google's.

Downloads last month
109
Safetensors
Model size
0.4B params
Tensor type
BF16
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for agentmish/gemma-4-12B-it-assistant-mlx-bf16

Finetuned
(2)
this model