OsaurusAI

DeepSeek-V4-Flash-JANGTQ2

DeepSeek-V4-Flash — 79.6 GB on disk (down from 149 GB FP4+FP8 source) — uniform 2-bit JANGTQ quantization on routed experts + 8-bit affine on everything else + preserved MTP head.

  • Source: deepseek-ai/DeepSeek-V4-Flash (43 transformer layers + 1 MTP head, 256 routed experts top-6 + 1 shared expert, 3 hash layers, MLA + mHC residuals, ~284 B total)
  • Quantization: uniform 2-bit MXTQ on routed-expert MLP + 8-bit affine on attention (wq_a/wq_b/wkv/wo_a/wo_b) / shared expert / Compressor / Indexer / embed / lm_head / MTP. RMSNorms, router gate, mHC fn matrices, attn_sink, ape stay fp16/fp32 passthrough.
  • Variant: std (preserves MTP layer 43; one-token-per-forward until a JANG runtime ships the accept/reject speculative-decode loop). The companion DeepSeek-V4-Flash-JANGTQ-K variant drops MTP for a smaller bundle.
  • Routed-expert layout: pre-stacked along axis 0 under ffn.experts.switch_mlp.{{gate_proj, up_proj, down_proj}} per the JANGTQ-PRESTACK STANDARD. Sidecar jangtq_runtime.safetensors (~24 KB) ships both (in=2048, bits=2) and (in=4096, bits=2) codebooks + sign-flip vectors for Swift runtimes.
  • Bundle size: ~79.6 GB on-disk
  • Runs on: M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+

Why top-6 + 2-bit holds

DSV4-Flash routes through 6 of 256 experts per token plus 1 always-on shared expert and 3 hash layers — so per-token output averages codebook noise across 7+ pathways. That's a much weaker quality constraint than top-1 architectures (where every token rides a single expert's quant error). MiniMax (top-2) and Hy3-preview (top-8) both ship coherent uniform JANGTQ2; DSV4 sits between them.

Loading (Python)

pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2")

chat = tokenizer.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
    tokenize=False,
    add_generation_prompt=True,
)

load_jangtq_model auto-registers model_type=deepseek_v4 via jang_tools.dsv4 before building the MLX skeleton. The loader applies the DSV4-specific MLA absorb + fp32 SDPA + mHC + Compressor + Indexer patches automatically.

Runtime support matrix

Surface Status
jang-tools Python (load_jangtq_model) ✅ working
vmlx-swift-lm Swift ✅ working — DeepseekV4JANGTQ family path
MTP speculative decode preserved-disabled — weights present (variant=std); accept/reject loop not yet in any JANG runtime

Validated runtime contract

  • 43 transformer layers + 1 MTP head materialize; 40 sparse-MoE layers hydrate routed experts via TurboQuantLinear (2-bit MXTQ).
  • 33,792 MXTQ tensors / 522 affine / 706 passthrough.
  • Capabilities: family=deepseek_v4, reasoning_parser=deepseek_r1, tool_parser=dsml, think_in_template=True, cache_type=mla.

Reasoning + tools

  • Reasoning parser: deepseek_r1
  • Tool parser: dsml (DeepSeek Markup Language — distinct from deepseek_tool_parser; see ~/jang/research/DSV4-EVAL-NUANCES.md)
  • Reasoning template: <|thinking_begin|>...<|thinking_end|> blocks via enable_thinking=True (default off — pass-through chat mode). Greedy T=0 with enable_thinking=True collapses into repetition on DSV4; use T=0.6 for pass@1 like the original DeepSeek release.
  • Cache: mla (Multi-head Latent Attention with kv_lora_rank=512)

Credits

  • Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
  • Source model: DeepSeek AI
  • License: MIT, inherited from upstream
Downloads last month
5,454
Safetensors
Model size
20B params
Tensor type
U32
·
F16
·
U8
·
I64
·
F32
·
I32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/DeepSeek-V4-Flash-JANGTQ2

Quantized
(50)
this model