Ornith-1.0-9B MTP head

An MTP (Multi-Token-Prediction) speculative-decode head for deepreinforce-ai/Ornith-1.0-9B. Ornith-1.0-9B shipped without the mtp.* tensors its Qwen3.5-9B base carries, so it serves with no native speculative speedup. This is that head, re-aligned to Ornith's hidden states.

Merge it into Ornith-1.0-9B (one command, below) and serve with vLLM's mtp method for a lossless +49–57% single-stream decode speedup. Lossless by construction: the base model verifies every drafted token, so the output distribution is unchanged β€” the head only buys throughput.

Results

Measured on a single RTX PRO 6000 Blackwell, vLLM 0.22.1, num_speculative_tokens=1. Acceptance is reported on two prompt distributions (coding, and WildBench/ToolACE-style), sampled at T=0.7.

Head Accept (coding) Accept (corpus) tok/s Notes
none (plain Ornith-9B) β€” β€” ~75 no MTP
graft (Qwen head, zero training) 0.763 0.742 ~117 free, reproducible below
this head (KL-distilled) 0.765 0.762 ~121 best

Two findings came out of building it:

  1. The graft is nearly free. Copying Qwen3.5-9B's MTP head onto Ornith verbatim already gives ~0.74–0.76 acceptance β€” the fine-tune is light enough that the base head transfers. You can reproduce that head in one command (no download needed); see below.
  2. The training objective is what matters. Re-distilling the head with hard cross-entropy on sampled tokens regressed acceptance (it sharpens the argmax but miscalibrates the distribution). MTP acceptance is rejection sampling against the target, which rewards a draft distribution that matches the target β€” so this head is trained with KL divergence to the target's own next-token distribution. Same data, same schedule; only the loss changed.

Use

# 1. merge this head into Ornith-1.0-9B (verbatim tensor copy; ~0.5 GB head, base untouched)
hf download protoLabsAI/Ornith-1.0-9B-MTP --local-dir ./ornith-mtp-head
python recipe/graft.py \
  --donor ./ornith-mtp-head \
  --target deepreinforce-ai/Ornith-1.0-9B \
  --out ./Ornith-1.0-9B-MTP

# 2. serve with vLLM's native MTP method
vllm serve ./Ornith-1.0-9B-MTP \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Reproduce the zero-training graft (no download)

python recipe/graft.py --donor Qwen/Qwen3.5-9B \
  --target deepreinforce-ai/Ornith-1.0-9B --out ./Ornith-1.0-9B-MTP-graft

Recipe

The full, donor-agnostic toolkit is in recipe/ β€” it retargets any Qwen3.5 fine-tune: graft.py (transplant the head), gen_corpus.py (self-distillation: the target's own generations, no external data), distill.py (loss: kl, freeze base / train only the head), eval_head.py (offline acceptance proxy), validate.sh. The recipe is the product.

Provenance & license

  • Base model: deepreinforce-ai/Ornith-1.0-9B (MIT). This head is a derivative; merging it produces a derivative of Ornith-1.0-9B β€” its MIT terms carry.
  • The head was initialized from Qwen/Qwen3.5-9B's mtp.* tensors, then re-trained on Ornith-9B's own generations.
  • Released under MIT by protoLabs.studio. Open core: free to fork, no paywall on the weights or the recipe.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for protoLabsAI/Ornith-1.0-9B-MTP

Finetuned
(12)
this model