igorls commited on
Commit
f966939
·
verified ·
1 Parent(s): 34f7c9e

Update MTP recipe: HTTP server example + vLLM caveat

Browse files
Files changed (1) hide show
  1. README.md +16 -1
README.md CHANGED
@@ -167,13 +167,28 @@ out = target.generate(
167
  print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
168
  ```
169
 
170
- For production throughput, use vLLM it implements the drafter's centroid-masking optimization (sparse lm_head over ~4K candidates instead of ~262K vocab, ~45x reduction):
 
 
 
 
 
 
 
 
 
 
 
 
 
171
 
172
  ```bash
173
  vllm serve igorls/gemma4-e4b-classifier \
174
  --speculative-config '{"model": "google/gemma-4-E4B-it-assistant", "num_speculative_tokens": 4}'
175
  ```
176
 
 
 
177
  ## License
178
 
179
  Inherited from the base model: [Gemma Terms of Use](https://ai.google.dev/gemma/terms). By using this model you agree to those terms.
 
167
  print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
168
  ```
169
 
170
+ For a self-hosted OpenAI-compatible HTTP endpoint, wrap the pair in a small FastAPI server that holds both models resident and exposes `/v1/chat/completions`. Example: [`scripts/08_mtp_server.py`](scripts/08_mtp_server.py) in the source repo, callable as:
171
+
172
+ ```bash
173
+ curl http://localhost:8765/v1/chat/completions -d '{
174
+ "model": "igorls/gemma4-e4b-classifier",
175
+ "messages": [{"role":"user","content":"What is the capital of France?"}],
176
+ "max_tokens": 16,
177
+ "use_mtp": true
178
+ }'
179
+ ```
180
+
181
+ ### vLLM (future)
182
+
183
+ vLLM is the right inference stack for production throughput — it implements the drafter's centroid-masking optimization (sparse lm_head over ~4K candidates instead of ~262K vocab, ~45x reduction in lm_head compute):
184
 
185
  ```bash
186
  vllm serve igorls/gemma4-e4b-classifier \
187
  --speculative-config '{"model": "google/gemma-4-E4B-it-assistant", "num_speculative_tokens": 4}'
188
  ```
189
 
190
+ **However**, as of May 2026 (vLLM 0.20.2, latest on PyPI), this fails: the drafter's `Gemma4AssistantConfig` is not yet registered in vLLM's `AutoModel` mapping. The vLLM Gemma 4 recipes page documents the feature but it's ahead of the released version. Track [vllm-project/vllm](https://github.com/vllm-project/vllm/) for the release that lands `Gemma4Assistant` support; once available, the command above should work as-is against this model.
191
+
192
  ## License
193
 
194
  Inherited from the base model: [Gemma Terms of Use](https://ai.google.dev/gemma/terms). By using this model you agree to those terms.