DeepSeek-V4-Flash-MTP-bf16

This repository contains Multi-Token Prediction (MTP) drafter weights split from deepseek-ai/DeepSeek-V4-Flash for use with mlx-vlm speculative decoding.

This is not a standalone chat or text-generation model. Load it as the draft model alongside a compatible DeepSeek V4 Flash target checkpoint.

Use with mlx-vlm

uv run mlx_vlm.generate \
  --model mlx-community/DeepSeek-V4-Flash-4bit \
  --draft-model mlx-community/DeepSeek-V4-Flash-MTP-bf16 \
  --prompt "Hi, how are you?" \
  --max-tokens 256 \
  --enable-thinking

For local weights:

uv run mlx_vlm.generate \
  --model /path/to/target-model \
  --draft-model /path/to/DeepSeek-V4-Flash-MTP \
  --prompt "Hi, how are you?" \
  --max-tokens 256 \
  --enable-thinking

Model Details

Model type: deepseek_v4_mtp
MTP block size: 2
Target architecture: DeepSeek V4 Flash
Precision: bf16 / mixed 8-bit + 4-bit as configured
Runtime: MLX / mlx-vlm
Format: Safetensors with MLX-compatible config and tokenizer files

The stored tensors include bfloat16 parameters and MLX quantized tensors as described in config.json.

Intended Use

Use this repo only as a speculative decoding drafter for compatible DeepSeek V4 Flash checkpoints. The target model verifies drafted tokens, while this MTP model proposes candidate tokens per decoding step.

Limitations

This checkpoint requires runtime support for Qwen/DeepSeek MTP draft models in mlx-vlm. Standard standalone generation through generic Transformers APIs is not expected to work with this repository by itself.

Please refer to the upstream deepseek-ai/DeepSeek-V4-Flash model card and license terms for model usage constraints.

Downloads last month: 2,955

Safetensors

Model size

1B params

Tensor type

F32

BF16

U32

MLX

Hardware compatibility

8-bit

Model tree for mlx-community/DeepSeek-V4-Flash-MTP-bf16

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(106)

this model