DeepSeek-V4-Flash-TQ-Q2.3-MLX

osmapi/DeepSeek-V4-Flash-TQ-Q2.3-MLX is an Apple-Silicon MLX TurboQuant/JANGTQ quantization of deepseek-ai/DeepSeek-V4-Flash.

No fine-tuning, distillation, or retraining was applied. The official mixed FP4/FP8 source weights were converted locally, the MTP head was dropped because it is not used for normal decode, and router/mHC/control tensors were preserved rather than aggressively quantized.

Model Details

Property	Value
Base model	`deepseek-ai/DeepSeek-V4-Flash`
Architecture	DeepSeek-V4 Flash MoE, 284B total / 13B active, 1M context
Local profile	`JANGTQ-Q2.3`
Bundle size	88.03 GB
Layout	Pre-stacked MLX `switch_mlp` layout
MTP head	Dropped
Validation	Safetensors header/index validation, metadata validation

Required Sidecar

This is a JANGTQ/TurboQuant bundle and requires jangtq_runtime.safetensors from this repository. The sidecar stores the deterministic codebooks and Hadamard rotation signs used to decode the .tq_packed expert weights. If it is missing, re-download the full repository or fetch that file explicitly:

hf download osmapi/DeepSeek-V4-Flash-TQ-Q2.3-MLX jangtq_runtime.safetensors --local-dir <your-model-dir>

Quantization Recipe

Tensor class	Codec	Bits / handling
Routed experts	TurboQuant MXTQ	110 routed layer/projection groups at 2-bit MXTQ and 19 at 4-bit MXTQ
Routed effective bits	MXTQ	2.2946 bits
Attention, shared experts, compressor, indexer, embed, lm head	MLX affine	8-bit, group size 32
Norms, router, mHC, sinks, integer routing tables	passthrough	source precision preserved

The fractional target is implemented as a power-of-two lane mix because the current JANGTQ vectorized packer is stable on 2/4/8-bit lanes for DeepSeek-V4 expert dimensions.

Use

Install the JANG loader/runtime and MLX LM:

pip install mlx-lm jang-tools

Example:

from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model, tokenizer = load_jangtq_model("osmapi/DeepSeek-V4-Flash-TQ-Q2.3-MLX")
prompt = "Write a short note about MLX quantization."
text = generate(model, tokenizer, prompt=prompt, verbose=True)
print(text)

Files

model-*.safetensors: pre-stacked JANGTQ/MLX shards
model.safetensors.index.json: shard index
jangtq_runtime.safetensors: required TurboQuant runtime sidecar
config.json, jang_config.json: MLX/JANGTQ metadata
encoding/: upstream DeepSeek-V4 prompt encoding reference

Notes

This card follows the same broad shape as the other osmapi DeepSeek-V4-Flash MLX uploads: a sidecar warning, an explicit recipe table, and minimal reproducible loading instructions. Q2.3 is an aggressive size-first TurboQuant profile, so treat it as experimental until evaluated on your target prompts.

License

MIT, following the upstream DeepSeek-V4-Flash release.

Downloads last month: 1,149

Safetensors

Model size

22B params

Tensor type

U32

F16

I64

F32

MLX

Hardware compatibility

Quantized

Model tree for osmapi/DeepSeek-V4-Flash-TQ-Q2.3-MLX

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(64)

this model