Instructions to use OsaurusAI/Qwen3.6-27B-MXFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/Qwen3.6-27B-MXFP4-MTP with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("OsaurusAI/Qwen3.6-27B-MXFP4-MTP") config = load_config("OsaurusAI/Qwen3.6-27B-MXFP4-MTP") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use OsaurusAI/Qwen3.6-27B-MXFP4-MTP with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/Qwen3.6-27B-MXFP4-MTP"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "OsaurusAI/Qwen3.6-27B-MXFP4-MTP" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use OsaurusAI/Qwen3.6-27B-MXFP4-MTP with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/Qwen3.6-27B-MXFP4-MTP"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default OsaurusAI/Qwen3.6-27B-MXFP4-MTP
Run Hermes
hermes

Qwen3.6-27B-MXFP4-MTP
Qwen3.6-27B (dense) quantized to native MXFP4 for Apple Silicon, with the vision tower and the native Multi-Token-Prediction head preserved and enabled. This is the smallest bundle in the line — full Qwen3.6-27B capability in 14 GB.
| Source | Qwen/Qwen3.6-27B |
| License | Apache-2.0, inherited from upstream |
| Format | MXFP4 (mx.quantize, affine, group_size=32) |
| Architecture | qwen3_5 dense — 64 layers, hybrid GatedDeltaNet + full attention, hidden 5120 |
| Modality | image + video + text |
| Context | 262,144 |
| Bundle size | 14.38 GB |
| MTP | native head preserved, enabled (num_nextn_predict_layers=1) |
Quantization
4-bit affine linears via MLX-native mx.quantize (mode="mxfp4",
group_size=32). Norms, hybrid-attention control tensors and the full
vision tower are kept in fp16 passthrough. MTP linears are quantized to
MXFP4; MTP norm/control tensors stay fp16.
Multi-Token Prediction
This bundle keeps Qwen3.6's native MTP module and runs it as a self-speculative draft head: the MTP head proposes tokens that the main model verifies in a single pass, so decoded output stays bit-identical to plain autoregressive decoding — only faster.
Recorded on an M5 Max (vMLX runtime, 96-token deterministic prompt, output verified equal to baseline at every depth):
| Draft depth | tok/s | Speedup |
|---|---|---|
| Baseline (MTP off) | 24.7 | 1.00× |
| D1 | 40.5 | 1.64× |
| D2 (default) | 45.7 | 1.85× |
| D3 | 45.0 | 1.83× |
On this bundle D2 is the fastest depth — D3 draws even but does not pull ahead, so the runtime selects D2 by default.
Absolute tok/s depends on free memory and system load. The speedup ratio — baseline vs. MTP measured back-to-back under identical conditions — is the stable figure.
Vision, MTP and caching together
This bundle runs image/video input, native MTP speculative decode and prefix/KV caching in the same session — a combination not every MTP-enabled Qwen build exposes. A recorded smoke test confirms both a text prompt and an image color prompt return correct answers through the combined MTP + VL runtime.
Loading
Loads via stock MLX tooling on Apple Silicon — the mxfp4 weights are
native mx.quantize affine, no JANG runtime required for the core model.
from mlx_vlm import load, generate
model, processor = load("OsaurusAI/Qwen3.6-27B-MXFP4-MTP")
The MTP draft path is exercised by an MTP-aware runtime (vMLX); other runtimes load and decode the main model normally and ignore the MTP head.
Variants
| Variant | Arch | Format | Size | Best MTP speedup |
|---|---|---|---|---|
| Qwen3.6-27B-MXFP4-MTP (this) | dense | mxfp4 | 14.4 GB | 1.85× (D2) |
| Qwen3.6-27B-MXFP8-MTP | dense | mxfp8 | 27.1 GB | 1.83× (D3) |
| Qwen3.6-35B-A3B-MXFP4-MTP | MoE | mxfp4 | 21.5 GB | 1.56× (D3) |
| Qwen3.6-35B-A3B-MXFP8-MTP | MoE | mxfp8 | 35.0 GB | 1.71× (D3) |
Credits
- Quantization toolchain: JANG by Jinho Jang <eric@osaurus.ai>
- Base model: Qwen3.6-27B by Qwen
- Downloads last month
- 422
Quantized
Model tree for OsaurusAI/Qwen3.6-27B-MXFP4-MTP
Base model
Qwen/Qwen3.6-27B