OsaurusAI

Qwen3.6-27B-MXFP4-MTP

Qwen3.6-27B (dense) quantized to native MXFP4 for Apple Silicon, with the vision tower and the native Multi-Token-Prediction head preserved and enabled. This is the smallest bundle in the line — full Qwen3.6-27B capability in 14 GB.

Source Qwen/Qwen3.6-27B
License Apache-2.0, inherited from upstream
Format MXFP4 (mx.quantize, affine, group_size=32)
Architecture qwen3_5 dense — 64 layers, hybrid GatedDeltaNet + full attention, hidden 5120
Modality image + video + text
Context 262,144
Bundle size 14.38 GB
MTP native head preserved, enabled (num_nextn_predict_layers=1)

Quantization

4-bit affine linears via MLX-native mx.quantize (mode="mxfp4", group_size=32). Norms, hybrid-attention control tensors and the full vision tower are kept in fp16 passthrough. MTP linears are quantized to MXFP4; MTP norm/control tensors stay fp16.

Multi-Token Prediction

This bundle keeps Qwen3.6's native MTP module and runs it as a self-speculative draft head: the MTP head proposes tokens that the main model verifies in a single pass, so decoded output stays bit-identical to plain autoregressive decoding — only faster.

Recorded on an M5 Max (vMLX runtime, 96-token deterministic prompt, output verified equal to baseline at every depth):

Draft depth tok/s Speedup
Baseline (MTP off) 24.7 1.00×
D1 40.5 1.64×
D2 (default) 45.7 1.85×
D3 45.0 1.83×

On this bundle D2 is the fastest depth — D3 draws even but does not pull ahead, so the runtime selects D2 by default.

Absolute tok/s depends on free memory and system load. The speedup ratio — baseline vs. MTP measured back-to-back under identical conditions — is the stable figure.

Vision, MTP and caching together

This bundle runs image/video input, native MTP speculative decode and prefix/KV caching in the same session — a combination not every MTP-enabled Qwen build exposes. A recorded smoke test confirms both a text prompt and an image color prompt return correct answers through the combined MTP + VL runtime.

Loading

Loads via stock MLX tooling on Apple Silicon — the mxfp4 weights are native mx.quantize affine, no JANG runtime required for the core model.

from mlx_vlm import load, generate
model, processor = load("OsaurusAI/Qwen3.6-27B-MXFP4-MTP")

The MTP draft path is exercised by an MTP-aware runtime (vMLX); other runtimes load and decode the main model normally and ignore the MTP head.

Variants

Variant Arch Format Size Best MTP speedup
Qwen3.6-27B-MXFP4-MTP (this) dense mxfp4 14.4 GB 1.85× (D2)
Qwen3.6-27B-MXFP8-MTP dense mxfp8 27.1 GB 1.83× (D3)
Qwen3.6-35B-A3B-MXFP4-MTP MoE mxfp4 21.5 GB 1.56× (D3)
Qwen3.6-35B-A3B-MXFP8-MTP MoE mxfp8 35.0 GB 1.71× (D3)

Credits

Downloads last month
422
Safetensors
Model size
5B params
Tensor type
U8
·
U32
·
BF16
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/Qwen3.6-27B-MXFP4-MTP

Base model

Qwen/Qwen3.6-27B
Quantized
(395)
this model