Qwen3.5-27B-MLX-MTP (affine-8)

MLX conversion of Qwen/Qwen3.5-27B, affine 8-bit (group_size 64), with the native Multi-Token-Prediction (MTP) head embedded in the main weight shards for native speculative decoding.

What changed from the previous packaging

This release replaces the previous full-precision sidecar packaging:

  • Quantization: full-precision → affine 8-bit (group_size 64). ~50 GB → ~27 GB.
  • MTP head: previously a mtp.safetensors sidecar that required mlx_lm_extra_tensors metadata to be honored. Now embedded directly in model.safetensors.index.json, so both stock mlx-lm (which silently strips MTP) and the AirRunner feat/mtp-native branch (PR 990) load it without any glue or symlink hack.
  • Fully verified for --mtp: produced as a one-step mlx_lm.convert -q --q-mode affine against the upstream Qwen base, no intermediate dequant→requant.

Conversion command

mlx_lm.convert --hf-path Qwen/Qwen3.5-27B \
  --mlx-path Qwen3.5-27B-MLX-MTP \
  -q --q-mode affine --q-bits 8 --q-group-size 64

Run

Without MTP (stock mlx-lm from PyPI):

mlx_lm.generate --model trevon/Qwen3.5-27B-MLX-MTP \
  --prompt "..." --max-tokens 100

With MTP (AirRunner fork):

git clone https://github.com/AirRunner/mlx-lm.git
cd mlx-lm && git checkout feat/mtp-native
uv venv && uv pip install -e .
mlx_lm.generate --model trevon/Qwen3.5-27B-MLX-MTP \
  --prompt "..." --max-tokens 100 --mtp

Benchmarks (Apple M4 Max)

Mode tokens/sec
no --mtp 14.5
--mtp 23.9 (+64%)

Apache 2.0 per upstream Qwen3.5.

Downloads last month
791
Safetensors
Model size
27B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for trevon/Qwen3.5-27B-MLX-MTP

Base model

Qwen/Qwen3.5-27B
Quantized
(206)
this model