DeepSeek-V4-Flash-JANGTQ-K

JANGTQ-K quantization of deepseek-ai/DeepSeek-V4-Flash for Apple Silicon MLX / vMLX-Swift runtimes.


Source	deepseek-ai/DeepSeek-V4-Flash
License	MIT, inherited from upstream
Format	JANGTQ (MXTQ routed + affine non-routed)
Profile	JANGTQ_K
Bundle size	~80 GiB (85.9 GB) across 80 shards
Tensor keys	2610
Routed expert layout	Pre-stacked `switch_mlp`
Context length	1,048,576 (1M)
Modality	text
Native cache schema	`deepseek_v4_v7`
Runtime status	Coherent on the relaxed 4-turn `reasoning_effort=max` quality-chat gate; NOT strict exact-copy cleared. See "Live Quality Boundary" below.

Important Runtime Note

This bundle requires a DSV4-aware runtime that implements:

DSV4 composite SWA + CSA/HCA cache (schema deepseek_v4_v7).
Mixed cache layout: layers 0,1 = KVCache, layers 2..42 = DeepseekV4Cache.
Sparse Indexer + Compressor with per-layer compression ratios.
mHC (multi-Head Composition) controls preserved as float32.
DSML tool-calling parser and the DSV4 chat encoder shim.

DSV4 is not a stock mlx_lm architecture.

Runtime Pin Required

Use a vMLX runtime build with DSV4 JANGTQ support:

Python: vMLX 7cc4927c or newer (composite deepseek_v4_v7, generic TQ-KV off, paged block 256, single-sequence forced).
Swift: vmlx-swift-lm build that includes DeepseekV4JANGTQ model + DSV4LayerCache pool accumulation.

Recommended runtime defaults baked into this bundle's jang_config.json:

DSV4_LONG_CTX=1
DSV4_POOL_QUANT=0 (opt-in only)
Paged cache block size = 256 (the loader upgrades stale 64-token settings).
Block disk L2 = on.
Chunked prefill = off by default.

Do not re-enable generic KV quantization, 64-token DSV4 pages, or pool quant for this bundle.

Architecture Summary

43 decoder layers + sparse indexer + compressor (no MTP head).
Hidden size 4096, 64 attention heads, 64 indexer heads, head dim 512, index head dim 128, indexer top-k 512.
MLA with q_lora_rank=1024, o_lora_rank=1024, qk_rope_head_dim=64.
256 routed experts + 1 shared expert per MoE layer, top-6 routing, routed_scaling_factor=1.5, topk_method=noaux_tc.
3 hash layers at L0/L1/L2.
Sliding window 128, rope_theta=10000, indexer compress_rope_theta=160000.
HC (multi-Head Composition): hc_mult=4, hc_sinkhorn_iters=20, hc_eps=1e-6.
SwiGLU clip swiglu_limit=10.0.
Context length 1,048,576 with YaRN scaling.
Reasoning: dual mode (chat / thinking) with enable_thinking + reasoning_effort ∈ {max, high, null}.
Tool calling: DSML parser (<｜DSML｜> blocks).

Quantization Recipe

Category	Bits	Codec	Notes
Routed experts (default, 38 layers)	2	MXTQ	L0–L22 except 23/25, L24, L26, L27, L29–L33, L35, L37–L42
Routed experts (5 lifted layers)	4	MXTQ	L23, L25, L28, L34, L36 (top-5 most-damaged-at-2-bit per real-activation probe)
Attention (`wq_a/wq_b/wkv/wo_a/wo_b`)	8	affine, gsz=32
Shared expert	8	affine, gsz=32
Compressor + Indexer + Indexer.Compressor	8	affine, gsz=32
`embed_tokens` + `lm_head`	8	affine, gsz=32
Norms / router gate / `hc_*` fn matrices	16	passthrough
`hc__base`, `hc__scale`, `attn_sink`, `ape`	32	source-f32	"F32" critical controls preserved
MTP head	—	dropped	`drop_mtp=true`; `num_nextn_predict_layers=0`

JANGTQ Runtime Sidecar

jangtq_runtime.safetensors ships the Swift JANGTQ MXTQ runtime contract:

codebook.2048.2   codebook.2048.4
codebook.4096.2   codebook.4096.4
signs.2048.42     signs.4096.42

The Swift loader will hard-error without this sidecar.

Live Quality Boundary

The kept JANGTQ-K candidate passed the relaxed quality-chat bar:

4-turn Chat Completions transcript with reasoning_effort=max.
Visible answers correct for arithmetic and memory recall (remembered HARBOR-17 and CERULEAN).
Final answer followed the requested 3-bullet format.
No loop, no reasoning-only empty visible answer, no API error.
Speed: 14.6 – 17.6 tok/s on M3 Ultra Mac Studio.
Paged DSV4 cache hit on turns 2–4 with 399 cached tokens saved.
Block-L2 wrote DSV4 composite blocks.

NOT strict exact-copy cleared. Short exact-marker rows can flip to neighboring BPE tokens; this reproduces with prefix-cache bypass and is not a UI, gateway, streaming, or disk-cache bug.

Raw max (VMLINUX_DSV4_RAW_MAX=1) had only a 2-turn smoke and is not advertised as fully cleared.

Chat Template

This bundle does not ship a standalone chat_template.jinja. Prompt rendering routes through the bundled DSV4 chat encoder (encoding/encoding_dsv4.py), driven by the DSV4 chat encoder shim installed by vmlx_engine.loaders.load_jangtq_dsv4.load_jangtq_dsv4_model. The shim preserves enable_thinking and reasoning_effort kwargs.

Sampling Defaults

temperature=0.6, top_p=0.95
repetition_penalty=1.0 (1.0 thinking, 1.05 chat)
max_new_tokens=4096

generation_config.json ships HF-defaults (temperature=1.0, top_p=1.0, do_sample=true, EOS=[1, 128803]); the JANG chat sampling defaults above are the ones used by vMLX's panel.

Runtime Smoke Tests

Before production use, run short deterministic prompts through the exact target runtime:

What is 2+2? Answer with only the number.
What is the capital of France? Answer with one word.
One chat-template prompt with thinking disabled.
One chat-template prompt with reasoning_effort=max and enough output budget for the final answer.

Lineage

Derived from converter variant V3 with plan routed_only_worst5_23_25_28_34_36 (sha256 3db0b31fe6f1b19d3e00cfdd15572ebf3af950ef25e1e8622e1f2791b1977619). Originally staged as DeepSeek-V4-Flash-JANGTQ-V3-WORST5-F32; renamed to JANGTQ-K on 2026-05-11 as the canonical DSV4 max-quality tier.

Files

config.json — DSV4 HF config (model_type=deepseek_v4, expert_dtype=fp4).
jang_config.json — JANG profile, recipe, bit plan, runtime requirements, chat encoder, sampling defaults, lineage.
model-00001-of-00080.safetensors … model-00080-of-00080.safetensors — sharded weights.
model.safetensors.index.json — tensor → shard map (2610 keys).
jangtq_runtime.safetensors — JANGTQ MXTQ runtime sidecar.
tokenizer.json, tokenizer_config.json — preserved upstream tokenizer (model_max_length=1048576).
generation_config.json — HF defaults.
encoding/encoding_dsv4.py — DSV4 chat encoder (Python).
LICENSE — MIT, upstream.

Korean Summary

이 번들은 deepseek-ai/DeepSeek-V4-Flash를 Apple Silicon MLX / vMLX-Swift 런타임용으로 JANGTQ-K 프로필로 양자화한 모델입니다. DSV4의 복합 SWA + CSA/HCA 캐시, Sparse Indexer + Compressor, mHC 컨트롤을 정확히 구현한 런타임에서만 사용해야 합니다. 라우티드 전문가는 기본 2-bit, 5개 레이어(23/25/28/34/36)는 4-bit로 유지하며 비-라우티드는 8-bit affine, 임계 제어 텐서는 소스 float32로 보존합니다. MTP 헤드는 추론에서 제거되었습니다.

Contact

eric@osaurus.ai

Downloads last month: 272

Safetensors

Model size

22B params

Tensor type

U32

F16

I64

F32

MLX

Hardware compatibility

Quantized

Model tree for OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(117)

this model