Qwen3.5-4B-MTP-bf16

This repository contains Multi-Token Prediction (MTP) drafter weights split from Qwen/Qwen3.5-4B for use with mlx-vlm speculative decoding.

This is not a standalone chat or text-generation model. Load it as the draft model alongside a compatible Qwen3.5 4B target checkpoint.

Use with mlx-vlm

uv run mlx_vlm.generate \
  --model Qwen/Qwen3.5-4B \
  --draft-model mlx-community/Qwen3.5-4B-MTP-bf16 \
  --prompt "Hi, how are you?" \
  --max-tokens 256 \
  --enable-thinking

For local weights:

uv run mlx_vlm.generate \
  --model /path/to/target-model \
  --draft-model /path/to/Qwen3.5-4B-mtp \
  --prompt "Hi, how are you?" \
  --max-tokens 256 \
  --enable-thinking

Model Details

  • Model type: qwen3_5_mtp
  • MTP block size: 2
  • Target architecture: Qwen3.5 4B
  • Precision: bf16
  • Runtime: MLX / mlx-vlm
  • Format: Safetensors with MLX-compatible config and tokenizer files

The stored tensors are unquantized MLX-compatible drafter weights.

Intended Use

Use this repo only as a speculative decoding drafter for compatible Qwen3.5 4B checkpoints. The target model verifies drafted tokens, while this MTP model proposes candidate tokens per decoding step.

Limitations

This checkpoint requires runtime support for Qwen/DeepSeek MTP draft models in mlx-vlm. Standard standalone generation through generic Transformers APIs is not expected to work with this repository by itself.

Please refer to the upstream Qwen/Qwen3.5-4B model card and license terms for model usage constraints.

Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Qwen3.5-4B-MTP-bf16

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(273)
this model

Collection including mlx-community/Qwen3.5-4B-MTP-bf16