Instructions to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K", "messages": [ {"role": "user", "content": "Hello"} ] }'

DeepSeek-V4-Flash-JANGTQ-K
JANGTQ-K quantization of deepseek-ai/DeepSeek-V4-Flash for Apple Silicon MLX / vMLX-Swift runtimes.
| Source | deepseek-ai/DeepSeek-V4-Flash |
| License | MIT, inherited from upstream |
| Format | JANGTQ (MXTQ routed + affine non-routed) |
| Profile | JANGTQ_K |
| Bundle size | ~80 GiB (85.9 GB) across 80 shards |
| Tensor keys | 2610 |
| Routed expert layout | Pre-stacked switch_mlp |
| Context length | 1,048,576 (1M) |
| Modality | text |
| Native cache schema | deepseek_v4_v7 |
| Runtime status | Coherent on the relaxed 4-turn reasoning_effort=max quality-chat gate; NOT strict exact-copy cleared. See "Live Quality Boundary" below. |
Important Runtime Note
This bundle requires a DSV4-aware runtime that implements:
- DSV4 composite SWA + CSA/HCA cache (schema
deepseek_v4_v7). - Mixed cache layout: layers
0,1 = KVCache, layers2..42 = DeepseekV4Cache. - Sparse Indexer + Compressor with per-layer compression ratios.
- mHC (multi-Head Composition) controls preserved as float32.
- DSML tool-calling parser and the DSV4 chat encoder shim.
DSV4 is not a stock mlx_lm architecture.
Runtime Pin Required
Use a vMLX runtime build with DSV4 JANGTQ support:
- Python: vMLX
7cc4927cor newer (compositedeepseek_v4_v7, generic TQ-KV off, paged block 256, single-sequence forced). - Swift:
vmlx-swift-lmbuild that includesDeepseekV4JANGTQmodel +DSV4LayerCachepool accumulation.
Recommended runtime defaults baked into this bundle's jang_config.json:
DSV4_LONG_CTX=1DSV4_POOL_QUANT=0(opt-in only)- Paged cache block size = 256 (the loader upgrades stale 64-token settings).
- Block disk L2 = on.
- Chunked prefill = off by default.
Do not re-enable generic KV quantization, 64-token DSV4 pages, or pool quant for this bundle.
Architecture Summary
- 43 decoder layers + sparse indexer + compressor (no MTP head).
- Hidden size 4096, 64 attention heads, 64 indexer heads, head dim 512, index head dim 128, indexer top-k 512.
- MLA with
q_lora_rank=1024,o_lora_rank=1024,qk_rope_head_dim=64. - 256 routed experts + 1 shared expert per MoE layer, top-6 routing,
routed_scaling_factor=1.5,topk_method=noaux_tc. - 3 hash layers at L0/L1/L2.
- Sliding window 128,
rope_theta=10000, indexercompress_rope_theta=160000. - HC (multi-Head Composition):
hc_mult=4,hc_sinkhorn_iters=20,hc_eps=1e-6. - SwiGLU clip
swiglu_limit=10.0. - Context length 1,048,576 with YaRN scaling.
- Reasoning: dual mode (chat / thinking) with
enable_thinking+reasoning_effort∈ {max,high,null}. - Tool calling: DSML parser (
<|DSML|>blocks).
Quantization Recipe
| Category | Bits | Codec | Notes |
|---|---|---|---|
| Routed experts (default, 38 layers) | 2 | MXTQ | L0–L22 except 23/25, L24, L26, L27, L29–L33, L35, L37–L42 |
| Routed experts (5 lifted layers) | 4 | MXTQ | L23, L25, L28, L34, L36 (top-5 most-damaged-at-2-bit per real-activation probe) |
Attention (wq_a/wq_b/wkv/wo_a/wo_b) |
8 | affine, gsz=32 | |
| Shared expert | 8 | affine, gsz=32 | |
| Compressor + Indexer + Indexer.Compressor | 8 | affine, gsz=32 | |
embed_tokens + lm_head |
8 | affine, gsz=32 | |
Norms / router gate / hc_* fn matrices |
16 | passthrough | |
hc_*_base, hc_*_scale, attn_sink, ape |
32 | source-f32 | "F32" critical controls preserved |
| MTP head | — | dropped | drop_mtp=true; num_nextn_predict_layers=0 |
JANGTQ Runtime Sidecar
jangtq_runtime.safetensors ships the Swift JANGTQ MXTQ runtime contract:
codebook.2048.2 codebook.2048.4
codebook.4096.2 codebook.4096.4
signs.2048.42 signs.4096.42
The Swift loader will hard-error without this sidecar.
Live Quality Boundary
The kept JANGTQ-K candidate passed the relaxed quality-chat bar:
- 4-turn Chat Completions transcript with
reasoning_effort=max. - Visible answers correct for arithmetic and memory recall (remembered
HARBOR-17andCERULEAN). - Final answer followed the requested 3-bullet format.
- No loop, no reasoning-only empty visible answer, no API error.
- Speed: 14.6 – 17.6 tok/s on M3 Ultra Mac Studio.
- Paged DSV4 cache hit on turns 2–4 with 399 cached tokens saved.
- Block-L2 wrote DSV4 composite blocks.
NOT strict exact-copy cleared. Short exact-marker rows can flip to neighboring BPE tokens; this reproduces with prefix-cache bypass and is not a UI, gateway, streaming, or disk-cache bug.
Raw max (VMLINUX_DSV4_RAW_MAX=1) had only a 2-turn smoke and is not advertised as fully cleared.
Chat Template
This bundle does not ship a standalone chat_template.jinja. Prompt rendering routes through the bundled DSV4 chat encoder (encoding/encoding_dsv4.py), driven by the DSV4 chat encoder shim installed by vmlx_engine.loaders.load_jangtq_dsv4.load_jangtq_dsv4_model. The shim preserves enable_thinking and reasoning_effort kwargs.
Sampling Defaults
temperature=0.6, top_p=0.95
repetition_penalty=1.0 (1.0 thinking, 1.05 chat)
max_new_tokens=4096
generation_config.json ships HF-defaults (temperature=1.0, top_p=1.0, do_sample=true, EOS=[1, 128803]); the JANG chat sampling defaults above are the ones used by vMLX's panel.
Runtime Smoke Tests
Before production use, run short deterministic prompts through the exact target runtime:
What is 2+2? Answer with only the number.What is the capital of France? Answer with one word.- One chat-template prompt with thinking disabled.
- One chat-template prompt with
reasoning_effort=maxand enough output budget for the final answer.
Lineage
Derived from converter variant V3 with plan routed_only_worst5_23_25_28_34_36 (sha256 3db0b31fe6f1b19d3e00cfdd15572ebf3af950ef25e1e8622e1f2791b1977619). Originally staged as DeepSeek-V4-Flash-JANGTQ-V3-WORST5-F32; renamed to JANGTQ-K on 2026-05-11 as the canonical DSV4 max-quality tier.
Files
config.json— DSV4 HF config (model_type=deepseek_v4,expert_dtype=fp4).jang_config.json— JANG profile, recipe, bit plan, runtime requirements, chat encoder, sampling defaults, lineage.model-00001-of-00080.safetensors … model-00080-of-00080.safetensors— sharded weights.model.safetensors.index.json— tensor → shard map (2610 keys).jangtq_runtime.safetensors— JANGTQ MXTQ runtime sidecar.tokenizer.json,tokenizer_config.json— preserved upstream tokenizer (model_max_length=1048576).generation_config.json— HF defaults.encoding/encoding_dsv4.py— DSV4 chat encoder (Python).LICENSE— MIT, upstream.
Korean Summary
이 번들은 deepseek-ai/DeepSeek-V4-Flash를 Apple Silicon MLX / vMLX-Swift 런타임용으로 JANGTQ-K 프로필로 양자화한 모델입니다. DSV4의 복합 SWA + CSA/HCA 캐시, Sparse Indexer + Compressor, mHC 컨트롤을 정확히 구현한 런타임에서만 사용해야 합니다. 라우티드 전문가는 기본 2-bit, 5개 레이어(23/25/28/34/36)는 4-bit로 유지하며 비-라우티드는 8-bit affine, 임계 제어 텐서는 소스 float32로 보존합니다. MTP 헤드는 추론에서 제거되었습니다.
Contact
- Downloads last month
- 1,316
Quantized
Model tree for OsaurusAI/DeepSeek-V4-Flash-JANGTQ-K
Base model
deepseek-ai/DeepSeek-V4-Flash