Instructions to use mlx-community/DeepSeek-V4-Flash-MTP-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/DeepSeek-V4-Flash-MTP-bf16 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("mlx-community/DeepSeek-V4-Flash-MTP-bf16") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use mlx-community/DeepSeek-V4-Flash-MTP-bf16 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "mlx-community/DeepSeek-V4-Flash-MTP-bf16" --prompt "Once upon a time"
DeepSeek-V4-Flash-MTP-bf16
This repository contains the Multi-Token Prediction (MTP) drafter weights split from DeepSeek-V4-Flash for use with mlx-vlm speculative decoding.
This is not a standalone chat or text-generation model. Load it as the draft model alongside the target DeepSeek V4 Flash model.
Use with mlx-vlm
uv run mlx_vlm.generate \
--model mlx-community/DeepSeek-V4-Flash-4bit \
--draft-model mlx-community/DeepSeek-V4-Flash-MTP-bf16 \
--prompt "Hi, how are you?" \
--max-tokens 256
For local weights:
uv run mlx_vlm.generate \
--model /path/to/DeepSeek-V4-Flash-4bit \
--draft-model /path/to/DeepSeek-V4-Flash-MTP \
--prompt "Hi, how are you?" \
--max-tokens 256
Model Details
- Model type:
deepseek_v4_mtp - MTP block size:
2 - Target architecture: DeepSeek V4 Flash
- Runtime: MLX /
mlx-vlm - Format: Safetensors with MLX-compatible config and tokenizer files
The stored tensors include bfloat16 parameters and MLX quantized tensors as described in config.json.
Intended Use
Use this repo only as a speculative decoding drafter for compatible DeepSeek V4 Flash checkpoints. The target model verifies drafted tokens, while this MTP model proposes multiple candidate tokens per decoding step.
Limitations
This checkpoint requires runtime support for DeepSeek V4 MTP draft models. Standard standalone generation through generic Transformers APIs is not expected to work with this repository by itself.
Please refer to the upstream DeepSeek-V4-Flash model card and license terms for model usage constraints.
- Downloads last month
- 1,008
8-bit
Model tree for mlx-community/DeepSeek-V4-Flash-MTP-bf16
Base model
deepseek-ai/DeepSeek-V4-Flash