DeepSeek-V4-Flash-TQ-Q2.3-MLX

osmapi/DeepSeek-V4-Flash-TQ-Q2.3-MLX is an Apple-Silicon MLX TurboQuant/JANGTQ quantization of deepseek-ai/DeepSeek-V4-Flash.

No fine-tuning, distillation, or retraining was applied. The official mixed FP4/FP8 source weights were converted locally, the MTP head was dropped because it is not used for normal decode, and router/mHC/control tensors were preserved rather than aggressively quantized.

Model Details

Property Value
Base model deepseek-ai/DeepSeek-V4-Flash
Architecture DeepSeek-V4 Flash MoE, 284B total / 13B active, 1M context
Local profile JANGTQ-Q2.3
Bundle size 88.03 GB
Layout Pre-stacked MLX switch_mlp layout
MTP head Dropped
Validation Safetensors header/index validation, metadata validation

Required Sidecar

This is a JANGTQ/TurboQuant bundle and requires jangtq_runtime.safetensors from this repository. The sidecar stores the deterministic codebooks and Hadamard rotation signs used to decode the .tq_packed expert weights. If it is missing, re-download the full repository or fetch that file explicitly:

hf download osmapi/DeepSeek-V4-Flash-TQ-Q2.3-MLX jangtq_runtime.safetensors --local-dir <your-model-dir>

Quantization Recipe

Tensor class Codec Bits / handling
Routed experts TurboQuant MXTQ 110 routed layer/projection groups at 2-bit MXTQ and 19 at 4-bit MXTQ
Routed effective bits MXTQ 2.2946 bits
Attention, shared experts, compressor, indexer, embed, lm head MLX affine 8-bit, group size 32
Norms, router, mHC, sinks, integer routing tables passthrough source precision preserved

The fractional target is implemented as a power-of-two lane mix because the current JANGTQ vectorized packer is stable on 2/4/8-bit lanes for DeepSeek-V4 expert dimensions.

Use

Install the JANG loader/runtime and MLX LM:

pip install mlx-lm jang-tools

Example:

from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model, tokenizer = load_jangtq_model("osmapi/DeepSeek-V4-Flash-TQ-Q2.3-MLX")
prompt = "Write a short note about MLX quantization."
text = generate(model, tokenizer, prompt=prompt, verbose=True)
print(text)

Files

  • model-*.safetensors: pre-stacked JANGTQ/MLX shards
  • model.safetensors.index.json: shard index
  • jangtq_runtime.safetensors: required TurboQuant runtime sidecar
  • config.json, jang_config.json: MLX/JANGTQ metadata
  • encoding/: upstream DeepSeek-V4 prompt encoding reference

Notes

This card follows the same broad shape as the other osmapi DeepSeek-V4-Flash MLX uploads: a sidecar warning, an explicit recipe table, and minimal reproducible loading instructions. Q2.3 is an aggressive size-first TurboQuant profile, so treat it as experimental until evaluated on your target prompts.

License

MIT, following the upstream DeepSeek-V4-Flash release.

Downloads last month
1,149
Safetensors
Model size
22B params
Tensor type
U32
·
F16
·
U8
·
I64
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osmapi/DeepSeek-V4-Flash-TQ-Q2.3-MLX

Quantized
(64)
this model