Gemma 4 31B MTP-on-vs-off bench handler
Custom handler for HF Inference Endpoints that loads google/gemma-4-31B-it (target) and google/gemma-4-31B-it-assistant (drafter), then per-request flips speculative decoding via the use_mtp parameter.
Request schema
{
"inputs": "your prompt",
"parameters": {
"max_new_tokens": 300,
"use_mtp": true,
"do_sample": true,
"temperature": 0.7
}
}
Response
Each call returns timing fields (elapsed_seconds, tokens_per_second, generated_tokens) so you can bench MTP-on vs MTP-off on identical hardware/weights.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support