igorls commited on
Commit
fffbaeb
·
verified ·
1 Parent(s): cf0525d

Add MTP acceleration recipe with measured speedup

Browse files
Files changed (1) hide show
  1. README.md +51 -1
README.md CHANGED
@@ -122,7 +122,57 @@ This confirms the architectural insight from prior research: safety alignment do
122
  - **Text-only.** No vision input. No audio input. The encoders are gone. Passing image or audio tokens will produce undefined behavior.
123
  - **Same context window as base** (128k tokens).
124
  - **Same tokenizer.** The vocab includes vision/audio special tokens (`<image>`, `<audio>`, etc.) for compatibility with the official tokenizer; these tokens won't activate any modality processing in this variant.
125
- - **No MTP drafter support on Ollama yet.** The official `google/gemma-4-E4B-it-assistant` MTP drafter works with Transformers and vLLM but not with Ollama on Linux/CUDA as of May 2026 (upstream llama.cpp doesn't recognize the `Gemma4AssistantForCausalLM` architecture). For MTP-accelerated inference, use Transformers or vLLM directly with this model as the target.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
  ## License
128
 
 
122
  - **Text-only.** No vision input. No audio input. The encoders are gone. Passing image or audio tokens will produce undefined behavior.
123
  - **Same context window as base** (128k tokens).
124
  - **Same tokenizer.** The vocab includes vision/audio special tokens (`<image>`, `<audio>`, etc.) for compatibility with the official tokenizer; these tokens won't activate any modality processing in this variant.
125
+ - **No MTP drafter support on Ollama yet.** Upstream llama.cpp doesn't recognize the `Gemma4AssistantForCausalLM` architecture as of May 2026, so Ollama on Linux/CUDA can't pair this target with the official MTP drafter. For MTP-accelerated inference, use Transformers or vLLM directly see the [MTP acceleration](#mtp-acceleration) section below.
126
+
127
+ ## MTP acceleration
128
+
129
+ The official MTP drafter [`google/gemma-4-E4B-it-assistant`](https://huggingface.co/google/gemma-4-E4B-it-assistant) (78M params, activation-aware) pairs cleanly with this stripped target. Output is lossless (byte-identical at deterministic decode). Measured on RTX 3090 via HF Transformers:
130
+
131
+ | Prompt shape | Tokens generated | Baseline | + MTP drafter | Speedup |
132
+ |---|---:|---:|---:|---:|
133
+ | MCQ single letter | 5 | 394 ms | 363 ms | 1.09x |
134
+ | Open Q one-word | 5 | 395 ms | 249 ms | 1.59x |
135
+ | Slug classification | 5 | 462 ms | 224 ms | 2.07x |
136
+ | JSON entity list (128 tok) | 128 | 12291 ms | 6712 ms | 1.83x |
137
+ | JSON memories (114 tok) | 114 | 8425 ms | **2771 ms** | **3.04x** |
138
+
139
+ Speedup tracks output predictability — structured JSON outputs (the most common MemPalace surface) land at the high end (3x).
140
+
141
+ ```python
142
+ from transformers import AutoModelForCausalLM, AutoTokenizer
143
+ import torch
144
+
145
+ target = AutoModelForCausalLM.from_pretrained(
146
+ "igorls/gemma4-e4b-classifier",
147
+ dtype=torch.bfloat16,
148
+ device_map="cuda",
149
+ )
150
+ drafter = AutoModelForCausalLM.from_pretrained(
151
+ "google/gemma-4-E4B-it-assistant",
152
+ dtype=torch.bfloat16,
153
+ device_map="cuda",
154
+ )
155
+ tok = AutoTokenizer.from_pretrained("igorls/gemma4-e4b-classifier")
156
+
157
+ messages = [{"role": "user", "content": "Classify into one word (indoor, outdoor): The kids are playing in the backyard."}]
158
+ chat = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
159
+ ids = tok(chat, return_tensors="pt").input_ids.to("cuda")
160
+
161
+ out = target.generate(
162
+ ids,
163
+ assistant_model=drafter,
164
+ max_new_tokens=20,
165
+ do_sample=False,
166
+ )
167
+ print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
168
+ ```
169
+
170
+ For production throughput, use vLLM — it implements the drafter's centroid-masking optimization (sparse lm_head over ~4K candidates instead of ~262K vocab, ~45x reduction):
171
+
172
+ ```bash
173
+ vllm serve igorls/gemma4-e4b-classifier \
174
+ --speculative-config '{"model": "google/gemma-4-E4B-it-assistant", "num_speculative_tokens": 4}'
175
+ ```
176
 
177
  ## License
178