DeepSeek-V4-Flash-Base INT4

A real INT4 packed-storage quantization of deepseek-ai/DeepSeek-V4-Flash-Base — a 284 B-parameter Mixture-of-Experts model.

Hero numbers

Metric	This release	Community Q4_K_M norm
MMLU (5 subjects × 50 q, 5-shot)	88.0 %	—
WikiText-2 PPL	3.76	—
WikiText-2 PPL ratio vs FP8 baseline	1.020	1.025 – 1.040
KL mean vs FP8 logits (500-prompt cache)	0.064	typical floor 0.10
KL p95	0.194	typical floor 0.30
Top-1 token agreement	0.922	typical floor 0.90
Disk size (TP = 4, all 4 shards)	156.6 GiB	(FP8 base ≈ 283 GiB → 45 % smaller)
Bit-exact reproducibility vs `apply_recipe(fp8_base, recipe)`	0 / 18,455 params drift on every rank	—

Per-subject MMLU breakdown:

Subject	Score
high_school_macroeconomics	92 %
high_school_us_history	98 %
college_mathematics	90 %
professional_law	68 %
miscellaneous	92 %
mean	88.0 %

This release comfortably beats community Q4_K_M norms on every quality metric, while halving on-disk size relative to the FP8 baseline.

What's in the box

4 TP-sharded safetensors files (model{0..3}-mp4.safetensors), self-contained
per_linear_quant.json — per-linear quantization metadata
quant_metadata.json — release info, byte counts, compression ratios
recipe.yaml — the exact quantization recipe used
config.json, tokenizer.json, tokenizer_config.json
This README

No separate FP8 base download is required. The shards include every tensor the loader needs.

Format

Mixed-method INT4 with per-linear dispatch. 8,552 linear layers quantized across 43 transformer blocks.

Component	Storage
`routed_experts.{w1,w2,w3}` (L0–L42)	INT4 g32 asym (Q4_K_M-style)
`attn_compress.*`	INT4 g32 asym
`shared_expert.{w1,w2,w3}`	INT4 g32 asym
`ffn.gate`	BF16 (with INT4 fakequant noise baked in)
`attn.wo_a`	BF16 (with INT4 fakequant noise baked in)
`attn.{q,k,v,wo_b,wkv}`	FP8 native (unchanged)
Embeddings, norms, mtp head	FP8 / BF16 native (unchanged)

Attention QKVO is intentionally kept at native FP8 — it's a small fraction of total parameters but quality-critical, so the bit-savings cost of leaving it native is small while the quality gain is large.

Loading

TP = 4 sharded. Use the harness loader:

from harness.lib.model_adapter import load_model_sharded, load_tokenizer
from harness.lib.int4_overlay_loader import apply_overlay

CKPT = "<path-to-this-snapshot>"
model = load_model_sharded(CKPT, "config-flash-base.json", max_seq_len=512, max_batch_size=1)
tokenizer = load_tokenizer(CKPT)
apply_overlay(model, CKPT, native_int4=False)   # FP8RT mode (default)

Runtime modes

Mode	Storage in VRAM	Decode tok/s (4 × H100)	When to use
FP8RT (default)	FP8 (re-encoded at load)	5.22	Default. Same speed as the FP8 baseline. INT4-quality weights with FP8-speed kernels.
`native_int4`	uint8 packed (~50 % VRAM saved on quantized portion)	2.87	Only if VRAM-bound. Marlin GEMM is currently 1.7 – 2.3 × slower than FP8 GEMM at V4's MoE shapes.

Prefill latencies (1 / 256 / 1024 tokens, ms): FP8 baseline 195 / 650 / 835, FP8RT 197 / 641 / 822, native_int4 330 / 1506 / 1928.

Reproducibility & verification

This release is bit-exact reproducible from the FP8 base + recipe. Verification was done at three levels:

State equality: GPU-side per-tensor fingerprint diff between apply_recipe(fp8_base, recipe.yaml) and this checkpoint loaded via apply_overlay. Result: 0 / 18,455 parameters and 0 / 191 buffers drift, on every TP rank.
Forward equality: tier-0 KL evaluated on a 500-prompt calibration cache — kl 0.064 / kl_p95 0.194 / agreement 0.922.
Roundtrip integrity: this artifact was uploaded to HF, downloaded fresh to a clean machine, and re-evaluated → kl 0.0649 / kl_p95 0.1837 / agreement 0.8906 (matches local within metric noise). MMLU 88.0 % is also from the HF-downloaded copy.

Quantization is deterministic: same recipe + same FP8 base → byte-equal output.

Caveats

Format: safetensors (PyTorch / transformers ecosystem), TP = 4 sharded — not GGUF. V4-Flash-Base does not yet have llama.cpp / GGUF support; once it lands upstream, this checkpoint can be repacked.
native_int4 mode is slower than FP8 GEMM at V4's expert shapes. The default FP8RT mode sidesteps this — you get INT4 quality with FP8 speed. If you need the VRAM savings of packed storage at runtime, expect ~1.8 × decode latency.

License

This quantization inherits the upstream model license. See the base model card for terms.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EnsueAI/DeepSeek-V4-Flash-Base-INT4

Base model

deepseek-ai/DeepSeek-V4-Flash-Base

Finetuned

(1)

this model