Qwen3.5-27B-MLX-MTP (affine-8)

MLX conversion of Qwen/Qwen3.5-27B, affine 8-bit (group_size 64), with the native Multi-Token-Prediction (MTP) head embedded in the main weight shards for native speculative decoding.

What changed from the previous packaging

This release replaces the previous full-precision sidecar packaging:

Quantization: full-precision → affine 8-bit (group_size 64). ~50 GB → ~27 GB.
MTP head: previously a mtp.safetensors sidecar that required mlx_lm_extra_tensors metadata to be honored. Now embedded directly in model.safetensors.index.json, so both stock mlx-lm (which silently strips MTP) and the AirRunner feat/mtp-native branch (PR 990) load it without any glue or symlink hack.
Fully verified for --mtp: produced as a one-step mlx_lm.convert -q --q-mode affine against the upstream Qwen base, no intermediate dequant→requant.

Conversion command

mlx_lm.convert --hf-path Qwen/Qwen3.5-27B \
  --mlx-path Qwen3.5-27B-MLX-MTP \
  -q --q-mode affine --q-bits 8 --q-group-size 64

Run

Without MTP (stock mlx-lm from PyPI):

mlx_lm.generate --model trevon/Qwen3.5-27B-MLX-MTP \
  --prompt "..." --max-tokens 100

With MTP (AirRunner fork):

git clone https://github.com/AirRunner/mlx-lm.git
cd mlx-lm && git checkout feat/mtp-native
uv venv && uv pip install -e .
mlx_lm.generate --model trevon/Qwen3.5-27B-MLX-MTP \
  --prompt "..." --max-tokens 100 --mtp

Benchmarks (Apple M4 Max)

Mode	tokens/sec
no `--mtp`	14.5
`--mtp`	23.9 (+64%)

Apache 2.0 per upstream Qwen3.5.

Downloads last month: 791

Safetensors

Model size

27B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

8-bit

Model tree for trevon/Qwen3.5-27B-MLX-MTP

Base model

Qwen/Qwen3.5-27B

Quantized

(206)

this model