Ornith-1.0-9B-MTP-GGUF

GGUF builds of deepreinforce-ai/Ornith-1.0-9B with a Multi-Token-Prediction (MTP) speculative head grafted back in, quantized to Q4_K_M, for use with llama.cpp's native MTP speculative decoding (--spec-type draft-mtp).

Ornith-1.0-9B (a Qwen3.5-based, hybrid linear-attention model) ships without the mtp.* tensors its base carries, so it serves with no speculative speedup. These builds graft the head back, bundle it into the GGUF as the nextn layer, and let llama.cpp draft + verify tokens for a free single-stream decode speedup. Lossless by construction: the base model verifies every drafted token, so the output distribution is unchanged — the head only buys throughput.

Files

File Head Notes
Ornith-1.0-9B-MTP-trained-Q4_K_M.gguf KL-distilled (re-aligned to Ornith) from protoLabsAI/Ornith-1.0-9B-MTP
Ornith-1.0-9B-MTP-graft-Q4_K_M.gguf zero-training graft Qwen3.5-9B's mtp.* copied verbatim onto Ornith

Each file is self-contained (trunk + MTP head in one GGUF, ~5.4 GB). The MTP head is exported as block 32 (blk.32.nextn.{eh_proj,enorm,hnorm,shared_head_norm} + a full-attention decoder layer).

Usage (llama.cpp)

Requires a llama.cpp build with Qwen3.5 (qwen35) + MTP support (PR #22673 / recent master). The MTP head runs as a draft context, so pass the same file as both target and draft:

llama-server \
  -m Ornith-1.0-9B-MTP-trained-Q4_K_M.gguf \
  -md Ornith-1.0-9B-MTP-trained-Q4_K_M.gguf \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -ngl 99 -ngld 99 -c 4096

Tune --spec-draft-n-max (start at 2). draft-mtp is also available in llama-cli.

Validation

Smoke-tested on the fresh build (b9827-0ed235ea2), RTX 3080 Laptop, Q4_K_M, --spec-draft-n-max 2, T=0.7, single short coding prompt:

Variant draft acceptance mean accept length
trained 0.81 (48/59) 2.60
graft 0.83 (49/59) 2.63

Both load as qwen35, engage draft-mtp, and produce coherent output. (Single-sample — too small to rank the two heads; consistent with the source card's ~0.76. Throughput numbers will differ on other GPUs.)

Provenance & license

  • Base model: deepreinforce-ai/Ornith-1.0-9B (MIT). These are derivatives; MIT terms carry.
  • Trained head: protoLabsAI/Ornith-1.0-9B-MTP (MIT). Graft head: initialized from Qwen/Qwen3.5-9B's mtp.* tensors.
  • Conversion: merged head into base (verbatim tensor copy), then llama.cpp convert_hf_to_gguf.py --outtype bf16llama-quantize Q4_K_M.
  • These are text-only (the converter exports the text trunk; the vision tower is dropped).

Released under MIT.

Downloads last month
57
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phucngodev/Ornith-1.0-9B-MTP

Quantized
(59)
this model