OsaurusAI

DeepSeek-V4-Flash-JANGTQ-K

JANGTQ-K quantization of deepseek-ai/DeepSeek-V4-Flash for Apple Silicon MLX / vMLX-Swift runtimes.

Source deepseek-ai/DeepSeek-V4-Flash
License MIT, inherited from upstream
Format JANGTQ (MXTQ routed + affine non-routed)
Profile JANGTQ_K
Bundle size ~80 GiB (85.9 GB) across 80 shards
Tensor keys 2610
Routed expert layout Pre-stacked switch_mlp
Context length 1,048,576 (1M)
Modality text
Native cache schema deepseek_v4_v7
Runtime status Coherent on the relaxed 4-turn reasoning_effort=max quality-chat gate; NOT strict exact-copy cleared. See "Live Quality Boundary" below.

Important Runtime Note

This bundle requires a DSV4-aware runtime that implements:

  • DSV4 composite SWA + CSA/HCA cache (schema deepseek_v4_v7).
  • Mixed cache layout: layers 0,1 = KVCache, layers 2..42 = DeepseekV4Cache.
  • Sparse Indexer + Compressor with per-layer compression ratios.
  • mHC (multi-Head Composition) controls preserved as float32.
  • DSML tool-calling parser and the DSV4 chat encoder shim.

DSV4 is not a stock mlx_lm architecture.

Runtime Pin Required

Use a vMLX runtime build with DSV4 JANGTQ support:

  • Python: vMLX 7cc4927c or newer (composite deepseek_v4_v7, generic TQ-KV off, paged block 256, single-sequence forced).
  • Swift: vmlx-swift-lm build that includes DeepseekV4JANGTQ model + DSV4LayerCache pool accumulation.

Recommended runtime defaults baked into this bundle's jang_config.json:

  • DSV4_LONG_CTX=1
  • DSV4_POOL_QUANT=0 (opt-in only)
  • Paged cache block size = 256 (the loader upgrades stale 64-token settings).
  • Block disk L2 = on.
  • Chunked prefill = off by default.

Do not re-enable generic KV quantization, 64-token DSV4 pages, or pool quant for this bundle.

Architecture Summary

  • 43 decoder layers + sparse indexer + compressor (no MTP head).
  • Hidden size 4096, 64 attention heads, 64 indexer heads, head dim 512, index head dim 128, indexer top-k 512.
  • MLA with q_lora_rank=1024, o_lora_rank=1024, qk_rope_head_dim=64.
  • 256 routed experts + 1 shared expert per MoE layer, top-6 routing, routed_scaling_factor=1.5, topk_method=noaux_tc.
  • 3 hash layers at L0/L1/L2.
  • Sliding window 128, rope_theta=10000, indexer compress_rope_theta=160000.
  • HC (multi-Head Composition): hc_mult=4, hc_sinkhorn_iters=20, hc_eps=1e-6.
  • SwiGLU clip swiglu_limit=10.0.
  • Context length 1,048,576 with YaRN scaling.
  • Reasoning: dual mode (chat / thinking) with enable_thinking + reasoning_effort ∈ {max, high, null}.
  • Tool calling: DSML parser (<|DSML|> blocks).

Quantization Recipe

Category Bits Codec Notes
Routed experts (default, 38 layers) 2 MXTQ L0–L22 except 23/25, L24, L26, L27, L29–L33, L35, L37–L42
Routed experts (5 lifted layers) 4 MXTQ L23, L25, L28, L34, L36 (top-5 most-damaged-at-2-bit per real-activation probe)
Attention (wq_a/wq_b/wkv/wo_a/wo_b) 8 affine, gsz=32
Shared expert 8 affine, gsz=32
Compressor + Indexer + Indexer.Compressor 8 affine, gsz=32
embed_tokens + lm_head 8 affine, gsz=32
Norms / router gate / hc_* fn matrices 16 passthrough
hc_*_base, hc_*_scale, attn_sink, ape 32 source-f32 "F32" critical controls preserved
MTP head dropped drop_mtp=true; num_nextn_predict_layers=0

JANGTQ Runtime Sidecar

jangtq_runtime.safetensors ships the Swift JANGTQ MXTQ runtime contract:

codebook.2048.2   codebook.2048.4
codebook.4096.2   codebook.4096.4
signs.2048.42     signs.4096.42

The Swift loader will hard-error without this sidecar.

Live Quality Boundary

The kept JANGTQ-K candidate passed the relaxed quality-chat bar:

  • 4-turn Chat Completions transcript with reasoning_effort=max.
  • Visible answers correct for arithmetic and memory recall (remembered HARBOR-17 and CERULEAN).
  • Final answer followed the requested 3-bullet format.
  • No loop, no reasoning-only empty visible answer, no API error.
  • Speed: 14.6 – 17.6 tok/s on M3 Ultra Mac Studio.
  • Paged DSV4 cache hit on turns 2–4 with 399 cached tokens saved.
  • Block-L2 wrote DSV4 composite blocks.

NOT strict exact-copy cleared. Short exact-marker rows can flip to neighboring BPE tokens; this reproduces with prefix-cache bypass and is not a UI, gateway, streaming, or disk-cache bug.

Raw max (VMLINUX_DSV4_RAW_MAX=1) had only a 2-turn smoke and is not advertised as fully cleared.

Chat Template

This bundle does not ship a standalone chat_template.jinja. Prompt rendering routes through the bundled DSV4 chat encoder (encoding/encoding_dsv4.py), driven by the DSV4 chat encoder shim installed by vmlx_engine.loaders.load_jangtq_dsv4.load_jangtq_dsv4_model. The shim preserves enable_thinking and reasoning_effort kwargs.

Sampling Defaults

temperature=0.6, top_p=0.95
repetition_penalty=1.0 (1.0 thinking, 1.05 chat)
max_new_tokens=4096

generation_config.json ships HF-defaults (temperature=1.0, top_p=1.0, do_sample=true, EOS=[1, 128803]); the JANG chat sampling defaults above are the ones used by vMLX's panel.

Runtime Smoke Tests

Before production use, run short deterministic prompts through the exact target runtime:

  • What is 2+2? Answer with only the number.
  • What is the capital of France? Answer with one word.
  • One chat-template prompt with thinking disabled.
  • One chat-template prompt with reasoning_effort=max and enough output budget for the final answer.

Lineage

Derived from converter variant V3 with plan routed_only_worst5_23_25_28_34_36 (sha256 3db0b31fe6f1b19d3e00cfdd15572ebf3af950ef25e1e8622e1f2791b1977619). Originally staged as DeepSeek-V4-Flash-JANGTQ-V3-WORST5-F32; renamed to JANGTQ-K on 2026-05-11 as the canonical DSV4 max-quality tier.

Files

  • config.json — DSV4 HF config (model_type=deepseek_v4, expert_dtype=fp4).
  • jang_config.json — JANG profile, recipe, bit plan, runtime requirements, chat encoder, sampling defaults, lineage.
  • model-00001-of-00080.safetensors … model-00080-of-00080.safetensors — sharded weights.
  • model.safetensors.index.json — tensor → shard map (2610 keys).
  • jangtq_runtime.safetensors — JANGTQ MXTQ runtime sidecar.
  • tokenizer.json, tokenizer_config.json — preserved upstream tokenizer (model_max_length=1048576).
  • generation_config.json — HF defaults.
  • encoding/encoding_dsv4.py — DSV4 chat encoder (Python).
  • LICENSE — MIT, upstream.

Korean Summary

이 번들은 deepseek-ai/DeepSeek-V4-Flash를 Apple Silicon MLX / vMLX-Swift 런타임용으로 JANGTQ-K 프로필로 양자화한 모델입니다. DSV4의 복합 SWA + CSA/HCA 캐시, Sparse Indexer + Compressor, mHC 컨트롤을 정확히 구현한 런타임에서만 사용해야 합니다. 라우티드 전문가는 기본 2-bit, 5개 레이어(23/25/28/34/36)는 4-bit로 유지하며 비-라우티드는 8-bit affine, 임계 제어 텐서는 소스 float32로 보존합니다. MTP 헤드는 추론에서 제거되었습니다.

Contact

eric@osaurus.ai

Downloads last month
1,316
Safetensors
Model size
22B params
Tensor type
U32
·
F16
·
U8
·
I64
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K

Quantized
(66)
this model