Gemma4 MTPLX Optimized Speed
This is an MTPLX pair bundle for Gemma 4 31B speculative decoding on Apple Silicon.
It is not a single vanilla Transformers model directory. The repository contains two MLX-format artifacts:
target/- Gemma 4 31B IT target, MLX Q4 affine group-size 64assistant/- official Gemma 4 31B assistant drafter, MLX Q6 affine group-size 64
Use this pair when absolute throughput is the priority.
Source
- Target source:
google/gemma-4-31B-it - Target revision:
145dc2508c480a64b47242f160d286cff94a2343 - Assistant source:
google/gemma-4-31B-it-assistant - Assistant revision:
cffbbd2cea41ea56a0fa5b0487e0d445121fd204
Both artifacts were converted locally to MLX format.
Quantization
Target:
bits: 4
group_size: 64
mode: affine
Assistant:
bits: 6
group_size: 64
mode: affine
MTPLX Usage
After downloading this repository, point MTPLX at the two subdirectories:
mtplx bench gemma-mtp \
--target-model ./target \
--assistant-model ./assistant \
--prompt-suite mtplx/benchmarks/prompts/flappy.jsonl \
--max-tokens 1000 \
--draft-block-sizes 6 \
--allow-unverified-gemma
The Gemma 4 assistant is a separate drafter model. MTPLX uses exact speculative sampling with target verification and residual correction.
Local Benchmark
Prompt: single-file HTML5 Canvas Flappy Bird game, capped at 1000 generated tokens.
Sampler:
temperature: 1.0
top_p: 0.95
top_k: 64
seed: 0
Best observed block size:
block_size: 6
acceptance: 830 / 846 = 98.11%
Observed MTPLX throughput samples:
43.56 tok/s
44.46 tok/s
44.07 tok/s
The bundled benchmark JSON files are in benchmarks/.
Notes
This release is optimized for MTPLX speed experiments. For a higher-precision target, use Youssofal/Gemma4-MTPLX-Optimized-Quality.
Gemma 4 is released by Google under the Gemma 4 license terms linked above.
Quantized