Instructions to use HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight",
	filename="gemma-4-e4b-it-mtp-assistant-ultralight.f16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16
# Run inference directly in the terminal:
llama-cli -hf HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16
# Run inference directly in the terminal:
llama-cli -hf HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16
# Run inference directly in the terminal:
./llama-cli -hf HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16

Use Docker

docker model run hf.co/HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16

LM Studio
Jan
Ollama
How to use HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight with Ollama:
```
ollama run hf.co/HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16
```

Unsloth Studio new

How to use HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight to start chatting

Docker Model Runner
How to use HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight with Docker Model Runner:
```
docker model run hf.co/HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16
```

Lemonade

How to use HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight:F16

Run and chat with the model

lemonade run user.gemma-4-e4b-it-mtp-assistant-ultralight-F16

List all available models

lemonade list

Gemma-4-Assistant-Drafter-MTP-UltraLight-GGUF

This is a custom-converted Multi-Token Prediction (MTP) drafter for Google Gemma 4 7.5B. It has been surgically optimized to provide massive speedups (40-60+ t/s) on consumer-grade hardware like the NVIDIA RTX 3060.

🚀 The "Ultra-Light" Optimization

Unlike standard conversions that pad the assistant to match the main model's 7.5B footprint, this version uses a Bridge-Slicing Strategy:

Native Embedding: Shrunk to 256 (down from 2560).
Compute Tax: Math operations reduced by ~90% per speculative cycle.
Footprint: Only ~815MB VRAM, allowing it to fit alongside the main model on 12GB cards.

Despite the heavy pruning, this drafter maintains a high acceptance rate (0.70 - 1.00) on common tasks, effectively doubling or tripling generation speed.

🛠️ How to Replicate the GGUF Conversion

If you want to modify the architecture further or audit the process, use the following uv command. This ensures the environment has the latest gguf-py logic required for MTP.

Prerequisites: Install uv.

uv run \
  --with torch \
  --with sentencepiece \
  --with transformers \
  --with safetensors \
  --with numpy \
  --with "gguf @ git+https://github.com/ggml-org/llama.cpp.git#subdirectory=gguf-py" \
  python ./convert_hf_to_gguf_latest.py \
  ./path/to/gemma-4-assistant-hf/ \
  --outfile ./gemma-4-e4b-it-mtp-assistant-ultralight.f16.gguf \
  --outtype f16

🛠️ Usage with llama-server

This was tested on a Dell Precision T7910 (yup, it's old, I know) with dual Xeons, 64GB system RAM and an RTX 3060 with 12GB VRAM. If you are using an RTX 3060 you can start with the example below and tune it from there.

Note: If you are on a multi-socket system, taskset is highly recommended to eliminate NUMA latency (ie. taskset -c 0-11 ./llama-server ...).

./llama-server \
  --model ./gemma-4-E4B-it-Q4_K_M.gguf \
  --model-draft ./gemma-4-e4b-it-mtp-assistant-ultralight.f16.gguf \
  -c 32768 \
  -np 1 \
  -ngl 43 \
  --swa-full \
  --kv-unified \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --spec-draft-n-max 8 \
  -b 128 -ub 128 \
  -t 12 -tb 12 \
  --flash-attn on \
  --mlock \
  --no-numa \
  --jinja

Key Parameters Explained:

--swa-full: Mandatory for Gemma 4 to prevent KV cache re-processing loops.
--spec-draft-n-max 8: 8 to 12 is seems to be the sweet spot for this drafter. It allows for a long enough stride without tanking the acceptance rate.
--kv-unified: Allows the main model and drafter to share prompt context memory.
-b 128 / -ub 128: Keeps the batch size tight for lower latency during the "verify" phase.

📊 Benchmarks (RTX 3060 / Dual Xeon E5-2600)

Prefill Speed: ~1600+ t/s
Generation Speed: 48-55+ t/s (Context dependent)
Draft Acceptance: 66% - 100%

💡 Note on Performance: On the RTX 3060, the standard llama.cpp (without MTP) is currently faster due to memory bandwidth saturation. This MTP Drafter is primarily intended for 3090/4090 users or researchers looking to exploit Multi-Token Prediction architecture.

Using llama.cpp (Without MTP)

Prefill Speed: ~2033+ t/s
Generation Speed: 55-57+ t/s

🧠 The "Speculation Tax" vs. Memory Bandwidth

Why does inference (without MTP) actually hits higher peak tokens-per-second (57 t/s) than the MTP-enabled setup (49 t/s) on the RTX 3060. Here is why:

The Memory Bandwidth Wall: The RTX 3060 has a 192-bit memory bus (360 GB/s). At ~57 t/s, the main 7.5B model is already saturating that bus. Adding an MTP drafter—even an "Ultra-Light" one—adds a "context switching" tax. The GPU must now juggle the weights and KV caches of two models instead of one.
The Latency Floor: On mid-range hardware, the time saved by "skipping" tokens via speculation is sometimes canceled out by the overhead of the verification handshake between the models.

🎯 Why use this MTP Drafter then?

Scaling to Higher-End GPUs: On cards like the RTX 3090/4090, the memory bandwidth is 2.5x–3x wider. On those cards, the "Draft Tax" is negligible, and MTP will likely push speeds past 140+ t/s, far exceeding raw inference.
predictive Drafting for Long Context: As conversations grow toward 128k, MTP can help maintain a consistent "rhythm" of generation by guessing long strings of common syntax or prose in a single pass.
Research & Edge Cases: This drafter is a "scout" for complex reasoning. Even if the raw t/s is lower, a high acceptance rate (0.70+) indicates the drafter is successfully modeling the main model's "thought process," which is critical for agentic workflows and advanced MTP research.

🏗️ Technical Details

The conversion was performed using a modified hf-to-gguf script that implements a resilient mapper for the MTP bridge. It slices the 2560-width hidden state of the main model down to 256 for the assistant while maintaining the 4-layer MTP head structure.

Architecture: Gemma4 (MTP-enabled)
Precision: F16
Layers: 4 (Assistant)
Params: ~407M (Effective)

🌌 Connect & Support

If you find this "Ultra-Light" drafter useful, feel free to reach out or tag me in your benchmarks!

Downloads last month: 1,349

GGUF

Model size

0.4B params

Architecture

gemma4

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HackAfterDark/gemma-4-e4b-it-mtp-assistant-ultralight

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Quantized

(169)

this model