How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight",
	filename="gemma-4-e4b-it-mtp-assistant-ultralight.f16.gguf",
)
output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Gemma-4-Assistant-Drafter-MTP-UltraLight-GGUF

This is a custom-converted Multi-Token Prediction (MTP) drafter for Google Gemma 4 7.5B. It has been surgically optimized to provide massive speedups (40-60+ t/s) on consumer-grade hardware like the NVIDIA RTX 3060.

πŸš€ The "Ultra-Light" Optimization

Unlike standard conversions that pad the assistant to match the main model's 7.5B footprint, this version uses a Bridge-Slicing Strategy:

  • Native Embedding: Shrunk to 256 (down from 2560).
  • Compute Tax: Math operations reduced by ~90% per speculative cycle.
  • Footprint: Only ~815MB VRAM, allowing it to fit alongside the main model on 12GB cards.

Despite the heavy pruning, this drafter maintains a high acceptance rate (0.70 - 1.00) on common tasks, effectively doubling or tripling generation speed.


πŸ› οΈ How to Replicate the GGUF Conversion

If you want to modify the architecture further or audit the process, use the following uv command. This ensures the environment has the latest gguf-py logic required for MTP.

Prerequisites: Install uv.

uv run \
  --with torch \
  --with sentencepiece \
  --with transformers \
  --with safetensors \
  --with numpy \
  --with "gguf @ git+https://github.com/ggml-org/llama.cpp.git#subdirectory=gguf-py" \
  python ./convert_hf_to_gguf_latest.py \
  ./path/to/gemma-4-assistant-hf/ \
  --outfile ./gemma-4-e4b-it-mtp-assistant-ultralight.f16.gguf \
  --outtype f16

πŸ› οΈ Usage with llama-server

This was tested on a Dell Precision T7910 (yup, it's old, I know) with dual Xeons, 64GB system RAM and an RTX 3060 with 12GB VRAM. If you are using an RTX 3060 you can start with the example below and tune it from there.

Note: If you are on a multi-socket system, taskset is highly recommended to eliminate NUMA latency (ie. taskset -c 0-11 ./llama-server ...).

./llama-server \
  --model ./gemma-4-E4B-it-Q4_K_M.gguf \
  --model-draft ./gemma-4-e4b-it-mtp-assistant-ultralight.f16.gguf \
  -c 32768 \
  -np 1 \
  -ngl 43 \
  --swa-full \
  --kv-unified \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --spec-draft-n-max 8 \
  -b 128 -ub 128 \
  -t 12 -tb 12 \
  --flash-attn on \
  --mlock \
  --no-numa \
  --jinja

Key Parameters Explained:

  • --swa-full: Mandatory for Gemma 4 to prevent KV cache re-processing loops.
  • --spec-draft-n-max 8: 8 to 12 is seems to be the sweet spot for this drafter. It allows for a long enough stride without tanking the acceptance rate.
  • --kv-unified: Allows the main model and drafter to share prompt context memory.
  • -b 128 / -ub 128: Keeps the batch size tight for lower latency during the "verify" phase.

πŸ“Š Benchmarks (RTX 3060 / Dual Xeon E5-2600)

  • Prefill Speed: ~1600+ t/s
  • Generation Speed: 48-55+ t/s (Context dependent)
  • Draft Acceptance: 66% - 100%

πŸ’‘ Note on Performance: On the RTX 3060, the standard llama.cpp (without MTP) is currently faster due to memory bandwidth saturation. This MTP Drafter is primarily intended for 3090/4090 users or researchers looking to exploit Multi-Token Prediction architecture.

Using llama.cpp (Without MTP)

  • Prefill Speed: ~2033+ t/s
  • Generation Speed: 55-57+ t/s

🧠 The "Speculation Tax" vs. Memory Bandwidth

Why does inference (without MTP) actually hits higher peak tokens-per-second (57 t/s) than the MTP-enabled setup (49 t/s) on the RTX 3060. Here is why:

  • The Memory Bandwidth Wall: The RTX 3060 has a 192-bit memory bus (360 GB/s). At ~57 t/s, the main 7.5B model is already saturating that bus. Adding an MTP drafterβ€”even an "Ultra-Light" oneβ€”adds a "context switching" tax. The GPU must now juggle the weights and KV caches of two models instead of one.
  • The Latency Floor: On mid-range hardware, the time saved by "skipping" tokens via speculation is sometimes canceled out by the overhead of the verification handshake between the models.

🎯 Why use this MTP Drafter then?

  1. Scaling to Higher-End GPUs: On cards like the RTX 3090/4090, the memory bandwidth is 2.5x–3x wider. On those cards, the "Draft Tax" is negligible, and MTP will likely push speeds past 140+ t/s, far exceeding raw inference.
  2. predictive Drafting for Long Context: As conversations grow toward 128k, MTP can help maintain a consistent "rhythm" of generation by guessing long strings of common syntax or prose in a single pass.
  3. Research & Edge Cases: This drafter is a "scout" for complex reasoning. Even if the raw t/s is lower, a high acceptance rate (0.70+) indicates the drafter is successfully modeling the main model's "thought process," which is critical for agentic workflows and advanced MTP research.

πŸ—οΈ Technical Details

The conversion was performed using a modified hf-to-gguf script that implements a resilient mapper for the MTP bridge. It slices the 2560-width hidden state of the main model down to 256 for the assistant while maintaining the 4-layer MTP head structure.

Architecture: Gemma4 (MTP-enabled)
Precision: F16
Layers: 4 (Assistant)
Params: ~407M (Effective)


🌌 Connect & Support

If you find this "Ultra-Light" drafter useful, feel free to reach out or tag me in your benchmarks!

Downloads last month
1,349
GGUF
Model size
0.4B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight

Quantized
(169)
this model