Instructions to use osmapi/DeepSeek-V4-Flash-5bit-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use osmapi/DeepSeek-V4-Flash-5bit-mlx with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("osmapi/DeepSeek-V4-Flash-5bit-mlx") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use osmapi/DeepSeek-V4-Flash-5bit-mlx with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "osmapi/DeepSeek-V4-Flash-5bit-mlx"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "osmapi/DeepSeek-V4-Flash-5bit-mlx" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "osmapi/DeepSeek-V4-Flash-5bit-mlx", "messages": [ {"role": "user", "content": "Hello"} ] }'
DeepSeek-V4-Flash-5bit-mlx
osmapi/DeepSeek-V4-Flash-5bit-mlx is an Apple-Silicon MLX quantization of deepseek-ai/DeepSeek-V4-Flash.
No fine-tuning, distillation, or retraining was applied. The official mixed FP4/FP8 source weights were converted locally, the MTP head was dropped because it is not used for normal decode, and router/mHC/control tensors were preserved rather than aggressively quantized.
Model Details
| Property | Value |
|---|---|
| Base model | deepseek-ai/DeepSeek-V4-Flash |
| Architecture | DeepSeek-V4 Flash MoE, 284B total / 13B active, 1M context |
| Local profile | MLX-Affine-Q5 |
| Bundle size | 195.67 GB |
| Layout | Pre-stacked MLX switch_mlp layout |
| MTP head | Dropped |
| Validation | Safetensors header/index validation, metadata validation |
Quantization Recipe
| Tensor class | Codec | Bits / handling |
|---|---|---|
| Linear/Embedding/SwitchLinear weights | MLX affine | 5-bit, group size 64 |
| Routed experts | MLX affine | pre-stacked switch_mlp tensors with .weight, .scales, .biases |
| Norms, router gate, mHC, sinks, APE, integer routing tables | passthrough | source precision preserved |
This is a normal MLX affine quantization, not JANGTQ/TurboQuant. Quantized tensors use the standard MLX triplet layout:
.weight.scales.biases
Use with MLX
pip install -U mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("osmapi/DeepSeek-V4-Flash-5bit-mlx")
prompt = "Write a short note about MLX quantization."
text = generate(model, tokenizer, prompt=prompt, verbose=True)
print(text)
For the DeepSeek-V4 official chat/message format, see the included encoding/ folder from the upstream repository.
Files
model-*.safetensors: standard MLX affine shardsmodel.safetensors.index.json: shard indexconfig.json,jang_config.json: MLX metadataencoding/: upstream DeepSeek-V4 prompt encoding reference
Notes
The naming follows the common MLX category convention used by mlx-community/*-4bit / *-8bit uploads and the local osmapi/*-6bit-mlx style, while the README keeps the explicit recipe/validation structure used by larger DeepSeek-V4 quant uploads.
License
MIT, following the upstream DeepSeek-V4-Flash release.
- Downloads last month
- 108
5-bit
Model tree for osmapi/DeepSeek-V4-Flash-5bit-mlx
Base model
deepseek-ai/DeepSeek-V4-Flash