DeepSeek-V4-Flash-Base INT4
A real INT4 packed-storage quantization of deepseek-ai/DeepSeek-V4-Flash-Base — a 284 B-parameter Mixture-of-Experts model.
Hero numbers
| Metric | This release | Community Q4_K_M norm |
|---|---|---|
| MMLU (5 subjects × 50 q, 5-shot) | 88.0 % | — |
| WikiText-2 PPL | 3.76 | — |
| WikiText-2 PPL ratio vs FP8 baseline | 1.020 | 1.025 – 1.040 |
| KL mean vs FP8 logits (500-prompt cache) | 0.064 | typical floor 0.10 |
| KL p95 | 0.194 | typical floor 0.30 |
| Top-1 token agreement | 0.922 | typical floor 0.90 |
| Disk size (TP = 4, all 4 shards) | 156.6 GiB | (FP8 base ≈ 283 GiB → 45 % smaller) |
Bit-exact reproducibility vs apply_recipe(fp8_base, recipe) |
0 / 18,455 params drift on every rank | — |
Per-subject MMLU breakdown:
| Subject | Score |
|---|---|
| high_school_macroeconomics | 92 % |
| high_school_us_history | 98 % |
| college_mathematics | 90 % |
| professional_law | 68 % |
| miscellaneous | 92 % |
| mean | 88.0 % |
This release comfortably beats community Q4_K_M norms on every quality metric, while halving on-disk size relative to the FP8 baseline.
What's in the box
- 4 TP-sharded safetensors files (
model{0..3}-mp4.safetensors), self-contained per_linear_quant.json— per-linear quantization metadataquant_metadata.json— release info, byte counts, compression ratiosrecipe.yaml— the exact quantization recipe usedconfig.json,tokenizer.json,tokenizer_config.json- This README
No separate FP8 base download is required. The shards include every tensor the loader needs.
Format
Mixed-method INT4 with per-linear dispatch. 8,552 linear layers quantized across 43 transformer blocks.
| Component | Storage |
|---|---|
routed_experts.{w1,w2,w3} (L0–L42) |
INT4 g32 asym (Q4_K_M-style) |
attn_compress.* |
INT4 g32 asym |
shared_expert.{w1,w2,w3} |
INT4 g32 asym |
ffn.gate |
BF16 (with INT4 fakequant noise baked in) |
attn.wo_a |
BF16 (with INT4 fakequant noise baked in) |
attn.{q,k,v,wo_b,wkv} |
FP8 native (unchanged) |
| Embeddings, norms, mtp head | FP8 / BF16 native (unchanged) |
Attention QKVO is intentionally kept at native FP8 — it's a small fraction of total parameters but quality-critical, so the bit-savings cost of leaving it native is small while the quality gain is large.
Loading
TP = 4 sharded. Use the harness loader:
from harness.lib.model_adapter import load_model_sharded, load_tokenizer
from harness.lib.int4_overlay_loader import apply_overlay
CKPT = "<path-to-this-snapshot>"
model = load_model_sharded(CKPT, "config-flash-base.json", max_seq_len=512, max_batch_size=1)
tokenizer = load_tokenizer(CKPT)
apply_overlay(model, CKPT, native_int4=False) # FP8RT mode (default)
Runtime modes
| Mode | Storage in VRAM | Decode tok/s (4 × H100) | When to use |
|---|---|---|---|
| FP8RT (default) | FP8 (re-encoded at load) | 5.22 | Default. Same speed as the FP8 baseline. INT4-quality weights with FP8-speed kernels. |
native_int4 |
uint8 packed (~50 % VRAM saved on quantized portion) | 2.87 | Only if VRAM-bound. Marlin GEMM is currently 1.7 – 2.3 × slower than FP8 GEMM at V4's MoE shapes. |
Prefill latencies (1 / 256 / 1024 tokens, ms): FP8 baseline 195 / 650 / 835, FP8RT 197 / 641 / 822, native_int4 330 / 1506 / 1928.
Reproducibility & verification
This release is bit-exact reproducible from the FP8 base + recipe. Verification was done at three levels:
- State equality: GPU-side per-tensor fingerprint diff between
apply_recipe(fp8_base, recipe.yaml)and this checkpoint loaded viaapply_overlay. Result: 0 / 18,455 parameters and 0 / 191 buffers drift, on every TP rank. - Forward equality: tier-0 KL evaluated on a 500-prompt calibration cache — kl 0.064 / kl_p95 0.194 / agreement 0.922.
- Roundtrip integrity: this artifact was uploaded to HF, downloaded fresh to a clean machine, and re-evaluated → kl 0.0649 / kl_p95 0.1837 / agreement 0.8906 (matches local within metric noise). MMLU 88.0 % is also from the HF-downloaded copy.
Quantization is deterministic: same recipe + same FP8 base → byte-equal output.
Caveats
- Format: safetensors (PyTorch / transformers ecosystem), TP = 4 sharded — not GGUF. V4-Flash-Base does not yet have llama.cpp / GGUF support; once it lands upstream, this checkpoint can be repacked.
native_int4mode is slower than FP8 GEMM at V4's expert shapes. The default FP8RT mode sidesteps this — you get INT4 quality with FP8 speed. If you need the VRAM savings of packed storage at runtime, expect ~1.8 × decode latency.
License
This quantization inherits the upstream model license. See the base model card for terms.
Model tree for EnsueAI/DeepSeek-V4-Flash-Base-INT4
Base model
deepseek-ai/DeepSeek-V4-Flash-Base