0 Bytes
5 files
Updated 12 days ago
README.md

Gemma 4 12B IT MTP Assistants for ik_llama

These are converted GGUF assistant/draft models for using Gemma 4 12B IT with ik_llama MTP speculative decoding.

They are not standalone chat models. Use one of these files as --model-draft next to the matching Gemma 4 12B IT target GGUF.

Files

  • gemma-4-12B-it-MTP-ik_llama-Q8_0.gguf
  • gemma-4-12B-it-MTP-ik_llama-Q4_K_M.gguf

Conversion Notes

Source assistant GGUF:

unsloth/gemma-4-12b-it-GGUF, MTP/gemma-4-12B-it-MTP-F16.gguf

The public assistant architecture string and tensor names were converted to ik_llama's gemma4_mtp schema. The unused public-assistant rope_freqs.weight tensor was omitted because ik_llama's Gemma 4 MTP assistant loader expects 48 tensors for this assistant.

Example

llama-server \
  -m /path/to/gemma-4-12b-it-IQ4_XS.gguf \
  --model-draft /path/to/gemma-4-12B-it-MTP-ik_llama-Q4_K_M.gguf \
  --spec-type mtp:n_max=4,p_min=0.0

Older ik_llama builds may use legacy speculative flags. Use a build that includes Gemma 4 12B MTP/CUDA support.

Smoke Test

Local smoke on an RTX 4070 with ik_llama build 4561 (6b9de3dba):

  • Target: Gemma 4 12B IT IQ4_XS
  • Draft: Q4_K_M
  • Raw completion TG: about 129 tok/s with MTP vs about 60 tok/s plain
  • Draft acceptance in the small smoke: about 0.55
Total size
0 Bytes
Files
5
Last updated
Jun 14
Pre-warmed CDN
US EU US EU

Contributors