DeepSeek-V4-Flash-MTP-bf16

This repository contains the Multi-Token Prediction (MTP) drafter weights split from DeepSeek-V4-Flash for use with mlx-vlm speculative decoding.

This is not a standalone chat or text-generation model. Load it as the draft model alongside the target DeepSeek V4 Flash model.

Use with mlx-vlm

uv run mlx_vlm.generate \
  --model mlx-community/DeepSeek-V4-Flash-4bit \
  --draft-model mlx-community/DeepSeek-V4-Flash-MTP-bf16 \
  --prompt "Hi, how are you?" \
  --max-tokens 256

For local weights:

uv run mlx_vlm.generate \
  --model /path/to/DeepSeek-V4-Flash-4bit \
  --draft-model /path/to/DeepSeek-V4-Flash-MTP \
  --prompt "Hi, how are you?" \
  --max-tokens 256

Model Details

  • Model type: deepseek_v4_mtp
  • MTP block size: 2
  • Target architecture: DeepSeek V4 Flash
  • Runtime: MLX / mlx-vlm
  • Format: Safetensors with MLX-compatible config and tokenizer files

The stored tensors include bfloat16 parameters and MLX quantized tensors as described in config.json.

Intended Use

Use this repo only as a speculative decoding drafter for compatible DeepSeek V4 Flash checkpoints. The target model verifies drafted tokens, while this MTP model proposes multiple candidate tokens per decoding step.

Limitations

This checkpoint requires runtime support for DeepSeek V4 MTP draft models. Standard standalone generation through generic Transformers APIs is not expected to work with this repository by itself.

Please refer to the upstream DeepSeek-V4-Flash model card and license terms for model usage constraints.

Downloads last month
1,008
Safetensors
Model size
1B params
Tensor type
F32
·
BF16
·
U8
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/DeepSeek-V4-Flash-MTP-bf16

Quantized
(63)
this model