Ornith-1.0-9B MTP head

An MTP (Multi-Token-Prediction) speculative-decode head for deepreinforce-ai/Ornith-1.0-9B. Ornith-1.0-9B shipped without the mtp.* tensors its Qwen3.5-9B base carries, so it serves with no native speculative speedup. This is that head, re-aligned to Ornith's hidden states.

Merge it into Ornith-1.0-9B (one command, below) and serve with vLLM's mtp method for a lossless +49–57% single-stream decode speedup. Lossless by construction: the base model verifies every drafted token, so the output distribution is unchanged — the head only buys throughput.

Results

Measured on a single RTX PRO 6000 Blackwell, vLLM 0.22.1, num_speculative_tokens=1. Acceptance is reported on two prompt distributions (coding, and WildBench/ToolACE-style), sampled at T=0.7.

Head	Accept (coding)	Accept (corpus)	tok/s	Notes
none (plain Ornith-9B)	—	—	~75	no MTP
graft (Qwen head, zero training)	0.763	0.742	~117	free, reproducible below
this head (KL-distilled)	0.765	0.762	~121	best

Two findings came out of building it:

The graft is nearly free. Copying Qwen3.5-9B's MTP head onto Ornith verbatim already gives ~0.74–0.76 acceptance — the fine-tune is light enough that the base head transfers. You can reproduce that head in one command (no download needed); see below.
The training objective is what matters. Re-distilling the head with hard cross-entropy on sampled tokens regressed acceptance (it sharpens the argmax but miscalibrates the distribution). MTP acceptance is rejection sampling against the target, which rewards a draft distribution that matches the target — so this head is trained with KL divergence to the target's own next-token distribution. Same data, same schedule; only the loss changed.

Use

# 1. merge this head into Ornith-1.0-9B (verbatim tensor copy; ~0.5 GB head, base untouched)
hf download protoLabsAI/Ornith-1.0-9B-MTP --local-dir ./ornith-mtp-head
python recipe/graft.py \
  --donor ./ornith-mtp-head \
  --target deepreinforce-ai/Ornith-1.0-9B \
  --out ./Ornith-1.0-9B-MTP

# 2. serve with vLLM's native MTP method
vllm serve ./Ornith-1.0-9B-MTP \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Reproduce the zero-training graft (no download)

python recipe/graft.py --donor Qwen/Qwen3.5-9B \
  --target deepreinforce-ai/Ornith-1.0-9B --out ./Ornith-1.0-9B-MTP-graft

Recipe

The full, donor-agnostic toolkit is in recipe/ — it retargets any Qwen3.5 fine-tune: graft.py (transplant the head), gen_corpus.py (self-distillation: the target's own generations, no external data), distill.py (loss: kl, freeze base / train only the head), eval_head.py (offline acceptance proxy), validate.sh. The recipe is the product.

Provenance & license

Base model: deepreinforce-ai/Ornith-1.0-9B (MIT). This head is a derivative; merging it produces a derivative of Ornith-1.0-9B — its MIT terms carry.
The head was initialized from Qwen/Qwen3.5-9B's mtp.* tensors, then re-trained on Ornith-9B's own generations.
Released under MIT by protoLabs.studio. Open core: free to fork, no paywall on the weights or the recipe.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for protoLabsAI/Ornith-1.0-9B-MTP

Base model

deepreinforce-ai/Ornith-1.0-9B

Finetuned

(12)

this model