Instructions to use agentmish/gemma-4-12B-it-assistant-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use agentmish/gemma-4-12B-it-assistant-mlx-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir gemma-4-12B-it-assistant-mlx-bf16 agentmish/gemma-4-12B-it-assistant-mlx-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
gemma-4-12B-it-assistant-mlx-bf16
The MTP (Multi-Token Prediction) speculative-decoding drafter for Gemma 4 12B IT, converted to
MLX (bf16) with mlx-vlm. This is an optional add-on that
accelerates decoding of the base model
agentmish/gemma-4-12B-it-mlx-8bit. It
is not a standalone chat model and is not required to run the base model.
- Source:
google/gemma-4-12B-it-assistant(Gemma4UnifiedAssistantForCausalLM,model_type=gemma4_unified_assistant). - Small drafter: 4 transformer layers, hidden size 1024,
backbone_hidden_size=3840(matches the target so it shares the target's token embeddings). ~837 MB in bf16. - Kept bf16 (not quantized) on purpose: aggressively quantizing a tiny drafter lowers its acceptance rate and erases the speculative speedup.
How it was converted
mlx-vlm 0.6.1, mlx 0.31.2:
mlx_vlm.convert --hf-path google/gemma-4-12B-it-assistant \
--mlx-path ./gemma-4-12B-it-assistant-mlx-bf16 --dtype bfloat16
Usage β opt-in MTP speculative decoding
The drafter is enabled by adding three flags to a normal mlx_vlm.generate call. The base model
runs identically without them. The best setting on this hardware is --draft-block-size 3.
python -m mlx_vlm.generate \
--model agentmish/gemma-4-12B-it-mlx-8bit \
--draft-model agentmish/gemma-4-12B-it-assistant-mlx-bf16 \
--draft-kind mtp --draft-block-size 3 \
--temperature 0.0 --max-tokens 400 \
--prompt "Explain how unified memory helps LLM inference on Apple Silicon."
mlx_vlm prints an acceptance line, e.g. Speculative decoding: 2.91 accepted tokens/round β¦ over N rounds.
Reminder:
mlx_vlm.generatehas no--top-p/--top-k; pass them via--gen-kwargs.
Measured speedup (Apple M-series, 96 GB, batch 1, greedy temp 0)
Unlike the common "batch-1 speculative decoding is neutral or slower" expectation, this official
drafter has high acceptance on Gemma 4 12B and delivers a real speedup. --draft-block-size 3 is the sweet spot.
Short factual prompt (capital of France + two landmarks, 32 generated tokens):
| draft-block-size | gen tok/s | speedup | accepted tok/round | output == base |
|---|---|---|---|---|
| baseline (no drafter) | 44.02 | 1.00x | β | β |
| 2 | 63.86 | 1.45x | 1.94 | yes |
| 3 | 74.71 | 1.70x | 2.91 | yes |
| 4 | 72.28 | 1.64x | 3.67 | yes |
| 6 | 68.00 | 1.55x | 3.78 | yes |
| 8 | 57.59 | 1.31x | 3.78 | yes |
Long generative prompt (3 paragraphs, 400 generated tokens):
| draft-block-size | gen tok/s | speedup | accepted tok/round | output == base |
|---|---|---|---|---|
| baseline (no drafter) | 42.41 | 1.00x | β | β |
| 2 | 54.50 | 1.29x | 1.67 | diverges* |
| 3 | 57.68 | 1.36x | 2.16 | diverges* |
| 4 | 49.47 | 1.17x | 2.33 | diverges* |
| 6 | 49.70 | 1.17x | 2.33 | diverges* |
| 8 | 49.50 | 1.17x | 2.33 | diverges* |
Temperature-0 output equivalence β the honest version
Speculative decoding is exact in theory (the target verifies every drafted token), but on this hardware it is floating-point nondeterministic in practice:
- Short outputs are byte-identical to running the base model alone, at every block size.
- Long outputs diverge. Baseline and drafter share a long common prefix, then split into
different but equally coherent and on-topic continuations (
* divergesabove).
Cause (not a bug): the drafter verifies several tokens in one batched forward pass, and Metal matmuls at that batch/sequence shape differ at the floating-point level from the one-token-at-a-time baseline. At a near-tie greedy logit the argmax flips and the sequences part ways. Short outputs hit no near-tie and stay identical; long ones eventually do. The acceptance stats (2β3 accepted tokens/round) confirm the drafter is doing real work. For bit-identical-to-baseline output, run without the drafter; for speed, use it.
License & attribution
Apache-2.0, inheriting the base. This is an MLX format conversion of
google/gemma-4-12B-it-assistant (Β© Google, Apache-2.0); weights and architecture are Google's.
- Downloads last month
- 109
Quantized
Model tree for agentmish/gemma-4-12B-it-assistant-mlx-bf16
Base model
google/gemma-4-12B-it-assistant