Ornith-1.0-9B MTP head
An MTP (Multi-Token-Prediction) speculative-decode head for
deepreinforce-ai/Ornith-1.0-9B.
Ornith-1.0-9B shipped without the mtp.* tensors its Qwen3.5-9B base carries, so it serves
with no native speculative speedup. This is that head, re-aligned to Ornith's hidden states.
Merge it into Ornith-1.0-9B (one command, below) and serve with vLLM's mtp method for a
lossless +49β57% single-stream decode speedup. Lossless by construction: the base model
verifies every drafted token, so the output distribution is unchanged β the head only buys
throughput.
Results
Measured on a single RTX PRO 6000 Blackwell, vLLM 0.22.1, num_speculative_tokens=1.
Acceptance is reported on two prompt distributions (coding, and WildBench/ToolACE-style),
sampled at T=0.7.
| Head | Accept (coding) | Accept (corpus) | tok/s | Notes |
|---|---|---|---|---|
| none (plain Ornith-9B) | β | β | ~75 | no MTP |
| graft (Qwen head, zero training) | 0.763 | 0.742 | ~117 | free, reproducible below |
| this head (KL-distilled) | 0.765 | 0.762 | ~121 | best |
Two findings came out of building it:
- The graft is nearly free. Copying Qwen3.5-9B's MTP head onto Ornith verbatim already gives ~0.74β0.76 acceptance β the fine-tune is light enough that the base head transfers. You can reproduce that head in one command (no download needed); see below.
- The training objective is what matters. Re-distilling the head with hard cross-entropy on sampled tokens regressed acceptance (it sharpens the argmax but miscalibrates the distribution). MTP acceptance is rejection sampling against the target, which rewards a draft distribution that matches the target β so this head is trained with KL divergence to the target's own next-token distribution. Same data, same schedule; only the loss changed.
Use
# 1. merge this head into Ornith-1.0-9B (verbatim tensor copy; ~0.5 GB head, base untouched)
hf download protoLabsAI/Ornith-1.0-9B-MTP --local-dir ./ornith-mtp-head
python recipe/graft.py \
--donor ./ornith-mtp-head \
--target deepreinforce-ai/Ornith-1.0-9B \
--out ./Ornith-1.0-9B-MTP
# 2. serve with vLLM's native MTP method
vllm serve ./Ornith-1.0-9B-MTP \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Reproduce the zero-training graft (no download)
python recipe/graft.py --donor Qwen/Qwen3.5-9B \
--target deepreinforce-ai/Ornith-1.0-9B --out ./Ornith-1.0-9B-MTP-graft
Recipe
The full, donor-agnostic toolkit is in recipe/ β it retargets any Qwen3.5
fine-tune: graft.py (transplant the head), gen_corpus.py (self-distillation: the target's
own generations, no external data), distill.py (loss: kl, freeze base / train only the
head), eval_head.py (offline acceptance proxy), validate.sh. The recipe is the product.
Provenance & license
- Base model:
deepreinforce-ai/Ornith-1.0-9B(MIT). This head is a derivative; merging it produces a derivative of Ornith-1.0-9B β its MIT terms carry. - The head was initialized from
Qwen/Qwen3.5-9B'smtp.*tensors, then re-trained on Ornith-9B's own generations. - Released under MIT by protoLabs.studio. Open core: free to fork, no paywall on the weights or the recipe.
Model tree for protoLabsAI/Ornith-1.0-9B-MTP
Base model
deepreinforce-ai/Ornith-1.0-9B