FoolDev Claude Opus 4.7 commited on
Commit
732c3be
·
1 Parent(s): 693cf65

docs: correct MTP claim — not actually usable via llama.cpp / Ollama today

Browse files

Investigation triggered by an attempt to document the consumer recipe
for llama.cpp's new MTP speculative decoding (PR #22673, merged
2026-05-16). Three findings, all corroborating:

1. `convert_hf_to_gguf.py` explicitly skips MTP tensors for the
qwen35 / qwen35moe arch family — comment in source: "MTP
tensors are not used at inference yet; align with Qwen3Next
behaviour".
2. `src/models/qwen35.cpp` and `qwen35moe.cpp` contain zero MTP /
nextn references — even if tensors were preserved, the loader
wouldn't read them.
3. `gguf.GGUFReader` on both this repo's bundled GGUF and the
source unsloth/Qwen3.6-27B-GGUF Q4_K_M: 851 tensors each, no
mtp/draft/eagle/spec entries, final tensor is
blk.63.post_attention_norm.weight. The MTP head from the
upstream safetensors was dropped during conversion.

PR #22673's MTP support landed for other architectures, not qwen35.
The README's "Multi-token prediction (MTP) head trained for
speculative decoding" bullet, taken at face value, was misleading
users into thinking the speedup was available via llama.cpp.

Corrected bullet now distinguishes upstream-safetensors-via-vLLM/SGLang
(working) from llama.cpp / Ollama (not yet), points at the relevant
PR for tracking, and footnotes the empirical check. CHANGELOG picks
up a "Fixed" entry with the full evidence trail.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (2) hide show
  1. CHANGELOG.md +28 -0
  2. README.md +14 -1
CHANGELOG.md CHANGED
@@ -7,6 +7,34 @@ and documentation**, not the underlying base model.
7
 
8
  ## [Unreleased]
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ### Changed (3rd round trip — qwen35 → qwen36, user-directed despite audit)
11
  - **Bundle re-stamped `general.architecture: 'qwen35'` → `'qwen36'`**
12
  in `hf upload` commit `973d7ef` (HF). Third stamp flip on the
 
7
 
8
  ## [Unreleased]
9
 
10
+ ### Fixed
11
+ - README "Multi-token prediction (MTP)" bullet corrected. The
12
+ earlier wording — "MTP head trained for speculative decoding" —
13
+ was technically true about the upstream `Qwen/Qwen3.6-27B`
14
+ safetensors but misleading for the GGUF bundle this repo ships
15
+ and for llama.cpp / Ollama users in general:
16
+ - **GGUFs are stripped.** `convert_hf_to_gguf.py` explicitly
17
+ skips MTP tensors for the `qwen35` / `qwen35moe` arch family
18
+ ("MTP tensors are not used at inference yet; align with
19
+ Qwen3Next behaviour"). Confirmed empirically via
20
+ `gguf.GGUFReader` on both `Thanatos-27B.Q4_K_M.qwen35.gguf`
21
+ and the source `unsloth/Qwen3.6-27B-GGUF` Q4_K_M: both have
22
+ 851 tensors and zero entries matching `mtp.*` / `draft.*` /
23
+ `eagle.*` / `spec.*`. Last tensor in either is
24
+ `blk.63.post_attention_norm.weight` — the final layer norm,
25
+ no MTP head after it.
26
+ - **Loader doesn't support it.** `src/models/qwen35.cpp` and
27
+ `qwen35moe.cpp` contain no MTP / nextn references; even if
28
+ the tensors were in the GGUF, the loader wouldn't use them.
29
+ - **PR #22673's scope.** llama.cpp's MTP support (merged
30
+ 2026-05-16) was added for other architectures, not the
31
+ `qwen35` family. The README bullet now says so explicitly,
32
+ points to vLLM (`qwen3_next_mtp`) and SGLang
33
+ (`--speculative-algo NEXTN`) as the working consumer
34
+ recipes against the safetensors, and notes that we're
35
+ tracking the follow-up that would extend MTP to qwen35 /
36
+ qwen35moe.
37
+
38
  ### Changed (3rd round trip — qwen35 → qwen36, user-directed despite audit)
39
  - **Bundle re-stamped `general.architecture: 'qwen35'` → `'qwen36'`**
40
  in `hf upload` commit `973d7ef` (HF). Third stamp flip on the
README.md CHANGED
@@ -182,7 +182,20 @@ If you want the safetensors for `transformers`, fetch them from [`Qwen/Qwen3.6-2
182
  `mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
183
  from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
184
  current loader compatibility.
185
- - Multi-token prediction (MTP) head trained for speculative decoding
 
 
 
 
 
 
 
 
 
 
 
 
 
186
 
187
  ### Stamp choice
188
 
 
182
  `mmproj` projector (not redistributed here; pull `mmproj-F16.gguf`
183
  from `unsloth/Qwen3.6-27B-GGUF`). See [Vision](#vision) below for
184
  current loader compatibility.
185
+ - Multi-token prediction (MTP) head trained for speculative decoding
186
+ present in the upstream `Qwen/Qwen3.6-27B` safetensors and usable via
187
+ vLLM (`qwen3_next_mtp`) or SGLang (`--speculative-algo NEXTN`).
188
+ **Not usable via llama.cpp / Ollama today**: the GGUF converter
189
+ (`convert_hf_to_gguf.py`) explicitly skips MTP tensors for the
190
+ `qwen35` / `qwen35moe` arch family ("MTP tensors are not used at
191
+ inference yet"), so the bundled GGUF and the unsloth GGUFs ship with
192
+ 851 tensors and no MTP head. llama.cpp's MTP support (PR #22673,
193
+ merged 2026-05-16) currently covers other architectures only;
194
+ tracking that PR's follow-up work for when qwen35 / qwen35moe
195
+ consumer support lands. (Earlier README versions claimed MTP was
196
+ available without this caveat — confirmed empirically via
197
+ `gguf.GGUFReader` on both this bundle and `unsloth/Qwen3.6-27B-GGUF`,
198
+ 2026-05-19.)
199
 
200
  ### Stamp choice
201