DeepSeek-V4-Flash-JANGTQ2

DeepSeek-V4-Flash — 79.6 GB on disk (down from 149 GB FP4+FP8 source) — uniform 2-bit JANGTQ quantization on routed experts + 8-bit affine on everything else + preserved MTP head.

Source: deepseek-ai/DeepSeek-V4-Flash (43 transformer layers + 1 MTP head, 256 routed experts top-6 + 1 shared expert, 3 hash layers, MLA + mHC residuals, ~284 B total)
Quantization: uniform 2-bit MXTQ on routed-expert MLP + 8-bit affine on attention (wq_a/wq_b/wkv/wo_a/wo_b) / shared expert / Compressor / Indexer / embed / lm_head / MTP. RMSNorms, router gate, mHC fn matrices, attn_sink, ape stay fp16/fp32 passthrough.
Variant: std (preserves MTP layer 43; one-token-per-forward until a JANG runtime ships the accept/reject speculative-decode loop). The companion DeepSeek-V4-Flash-JANGTQ-K variant drops MTP for a smaller bundle.
Routed-expert layout: pre-stacked along axis 0 under ffn.experts.switch_mlp.{{gate_proj, up_proj, down_proj}} per the JANGTQ-PRESTACK STANDARD. Sidecar jangtq_runtime.safetensors (~24 KB) ships both (in=2048, bits=2) and (in=4096, bits=2) codebooks + sign-flip vectors for Swift runtimes.
Bundle size: ~79.6 GB on-disk
Runs on: M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+

Why top-6 + 2-bit holds

DSV4-Flash routes through 6 of 256 experts per token plus 1 always-on shared expert and 3 hash layers — so per-token output averages codebook noise across 7+ pathways. That's a much weaker quality constraint than top-1 architectures (where every token rides a single expert's quant error). MiniMax (top-2) and Hy3-preview (top-8) both ship coherent uniform JANGTQ2; DSV4 sits between them.

Loading (Python)

pip install jang-tools mlx-lm

from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2")

chat = tokenizer.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
    tokenize=False,
    add_generation_prompt=True,
)

load_jangtq_model auto-registers model_type=deepseek_v4 via jang_tools.dsv4 before building the MLX skeleton. The loader applies the DSV4-specific MLA absorb + fp32 SDPA + mHC + Compressor + Indexer patches automatically.

Runtime support matrix

Surface	Status
`jang-tools` Python (`load_jangtq_model`)	✅ working
`vmlx-swift-lm` Swift	✅ working — `DeepseekV4JANGTQ` family path
MTP speculative decode	preserved-disabled — weights present (variant=std); accept/reject loop not yet in any JANG runtime

Validated runtime contract

43 transformer layers + 1 MTP head materialize; 40 sparse-MoE layers hydrate routed experts via TurboQuantLinear (2-bit MXTQ).
33,792 MXTQ tensors / 522 affine / 706 passthrough.
Capabilities: family=deepseek_v4, reasoning_parser=deepseek_r1, tool_parser=dsml, think_in_template=True, cache_type=mla.

Reasoning + tools

Reasoning parser: deepseek_r1
Tool parser: dsml (DeepSeek Markup Language — distinct from deepseek_tool_parser; see ~/jang/research/DSV4-EVAL-NUANCES.md)
Reasoning template: <｜thinking_begin｜>...<｜thinking_end｜> blocks via enable_thinking=True (default off — pass-through chat mode). Greedy T=0 with enable_thinking=True collapses into repetition on DSV4; use T=0.6 for pass@1 like the original DeepSeek release.
Cache: mla (Multi-head Latent Attention with kv_lora_rank=512)

Credits

Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
Source model: DeepSeek AI
License: MIT, inherited from upstream

Downloads last month: 3,171

Safetensors

Model size

20B params

Tensor type

U32

F16

I64

F32

I32

MLX

Hardware compatibility

Quantized

Model tree for OsaurusAI/DeepSeek-V4-Flash-JANGTQ2

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(93)

this model