Instructions to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ2 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ2 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "OsaurusAI/DeepSeek-V4-Flash-JANGTQ2" --prompt "Once upon a time"

DeepSeek-V4-Flash-JANGTQ2
DeepSeek-V4-Flash — 79.6 GB on disk (down from 149 GB FP4+FP8 source) — uniform 2-bit JANGTQ quantization on routed experts + 8-bit affine on everything else + preserved MTP head.
- Source: deepseek-ai/DeepSeek-V4-Flash (43 transformer layers + 1 MTP head, 256 routed experts top-6 + 1 shared expert, 3 hash layers, MLA + mHC residuals, ~284 B total)
- Quantization: uniform 2-bit MXTQ on routed-expert MLP +
8-bit affine on attention (
wq_a/wq_b/wkv/wo_a/wo_b) / shared expert / Compressor / Indexer / embed / lm_head / MTP. RMSNorms, router gate, mHC fn matrices, attn_sink, ape stay fp16/fp32 passthrough. - Variant:
std(preserves MTP layer 43; one-token-per-forward until a JANG runtime ships the accept/reject speculative-decode loop). The companionDeepSeek-V4-Flash-JANGTQ-Kvariant drops MTP for a smaller bundle. - Routed-expert layout: pre-stacked along axis 0 under
ffn.experts.switch_mlp.{{gate_proj, up_proj, down_proj}}per the JANGTQ-PRESTACK STANDARD. Sidecarjangtq_runtime.safetensors(~24 KB) ships both(in=2048, bits=2)and(in=4096, bits=2)codebooks + sign-flip vectors for Swift runtimes. - Bundle size: ~79.6 GB on-disk
- Runs on: M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+
Why top-6 + 2-bit holds
DSV4-Flash routes through 6 of 256 experts per token plus 1 always-on shared expert and 3 hash layers — so per-token output averages codebook noise across 7+ pathways. That's a much weaker quality constraint than top-1 architectures (where every token rides a single expert's quant error). MiniMax (top-2) and Hy3-preview (top-8) both ship coherent uniform JANGTQ2; DSV4 sits between them.
Loading (Python)
pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2")
chat = tokenizer.apply_chat_template(
[{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
tokenize=False,
add_generation_prompt=True,
)
load_jangtq_model auto-registers model_type=deepseek_v4 via
jang_tools.dsv4 before building the MLX skeleton. The loader applies
the DSV4-specific MLA absorb + fp32 SDPA + mHC + Compressor + Indexer
patches automatically.
Runtime support matrix
| Surface | Status |
|---|---|
jang-tools Python (load_jangtq_model) |
✅ working |
vmlx-swift-lm Swift |
✅ working — DeepseekV4JANGTQ family path |
| MTP speculative decode | preserved-disabled — weights present (variant=std); accept/reject loop not yet in any JANG runtime |
Validated runtime contract
- 43 transformer layers + 1 MTP head materialize; 40 sparse-MoE layers hydrate routed experts via TurboQuantLinear (2-bit MXTQ).
- 33,792 MXTQ tensors / 522 affine / 706 passthrough.
- Capabilities:
family=deepseek_v4,reasoning_parser=deepseek_r1,tool_parser=dsml,think_in_template=True,cache_type=mla.
Reasoning + tools
- Reasoning parser:
deepseek_r1 - Tool parser:
dsml(DeepSeek Markup Language — distinct fromdeepseek_tool_parser; see~/jang/research/DSV4-EVAL-NUANCES.md) - Reasoning template:
<|thinking_begin|>...<|thinking_end|>blocks viaenable_thinking=True(default off — pass-through chat mode). GreedyT=0withenable_thinking=Truecollapses into repetition on DSV4; useT=0.6for pass@1 like the original DeepSeek release. - Cache:
mla(Multi-head Latent Attention with kv_lora_rank=512)
Credits
- Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
- Source model: DeepSeek AI
- License: MIT, inherited from upstream
- Downloads last month
- 5,454
Quantized
Model tree for OsaurusAI/DeepSeek-V4-Flash-JANGTQ2
Base model
deepseek-ai/DeepSeek-V4-Flash