Qwen3.6-27B-mtp (affine-8)

MLX conversion of Qwen/Qwen3.6-27B, affine 8-bit (group_size 64), with the native Multi-Token-Prediction (MTP) head embedded in the main weight shards for native speculative decoding.

What changed from the previous packaging

This release replaces the previous mxfp8 packaging, which produced a ~29% slowdown when --mtp was enabled. Two compounding causes:

mxfp8 backbone: the speculative verify pass processes 2 tokens through the full backbone every step. mxfp8's per-call dequant overhead does not amortize across that path on Apple Silicon — re-quantizing the same artifact to affine-8 reduced the regression from 29% to 10%.
Lossy round-trip in the original packaging: even after re-quantization, the MTP head's calibration was damaged enough that MTP remained a net regression. A fresh affine-8 conversion straight from Qwen/Qwen3.6-27B (this artifact) restores +48% with --mtp.

This release is therefore a fresh affine-8 conversion from the upstream Qwen base, not a re-packaging of the previous mxfp8 artifact.

Conversion command

mlx_lm.convert --hf-path Qwen/Qwen3.6-27B \
  --mlx-path Qwen3.6-27B-mtp \
  -q --q-mode affine --q-bits 8 --q-group-size 64

Run

Without MTP (stock mlx-lm from PyPI):

mlx_lm.generate --model trevon/Qwen3.6-27B-mtp \
  --prompt "..." --max-tokens 100

With MTP (AirRunner feat/mtp-native, PR 990):

git clone https://github.com/AirRunner/mlx-lm.git
cd mlx-lm && git checkout feat/mtp-native
uv venv && uv pip install -e .
mlx_lm.generate --model trevon/Qwen3.6-27B-mtp \
  --prompt "..." --max-tokens 100 --mtp

Benchmarks (Apple M4 Max)

Mode	tokens/sec
no `--mtp`	15.1
`--mtp`	22.4 (+48%)

License: per upstream Qwen/Qwen3.6-27B.

Downloads last month: 2,132

Safetensors

Model size

27B params

Tensor type

BF16

U32

MLX

Hardware compatibility

8-bit

Model tree for trevon/Qwen3.6-27B-mtp

Base model

Qwen/Qwen3.6-27B

Quantized

(435)

this model