DeepSeek-V4-Flash-Base INT4

A real INT4 packed-storage quantization of deepseek-ai/DeepSeek-V4-Flash-Base — a 284 B-parameter Mixture-of-Experts model.

Hero numbers

Metric This release Community Q4_K_M norm
MMLU (5 subjects × 50 q, 5-shot) 88.0 %
WikiText-2 PPL 3.76
WikiText-2 PPL ratio vs FP8 baseline 1.020 1.025 – 1.040
KL mean vs FP8 logits (500-prompt cache) 0.064 typical floor 0.10
KL p95 0.194 typical floor 0.30
Top-1 token agreement 0.922 typical floor 0.90
Disk size (TP = 4, all 4 shards) 156.6 GiB (FP8 base ≈ 283 GiB → 45 % smaller)
Bit-exact reproducibility vs apply_recipe(fp8_base, recipe) 0 / 18,455 params drift on every rank

Per-subject MMLU breakdown:

Subject Score
high_school_macroeconomics 92 %
high_school_us_history 98 %
college_mathematics 90 %
professional_law 68 %
miscellaneous 92 %
mean 88.0 %

This release comfortably beats community Q4_K_M norms on every quality metric, while halving on-disk size relative to the FP8 baseline.


What's in the box

  • 4 TP-sharded safetensors files (model{0..3}-mp4.safetensors), self-contained
  • per_linear_quant.json — per-linear quantization metadata
  • quant_metadata.json — release info, byte counts, compression ratios
  • recipe.yaml — the exact quantization recipe used
  • config.json, tokenizer.json, tokenizer_config.json
  • This README

No separate FP8 base download is required. The shards include every tensor the loader needs.

Format

Mixed-method INT4 with per-linear dispatch. 8,552 linear layers quantized across 43 transformer blocks.

Component Storage
routed_experts.{w1,w2,w3} (L0–L42) INT4 g32 asym (Q4_K_M-style)
attn_compress.* INT4 g32 asym
shared_expert.{w1,w2,w3} INT4 g32 asym
ffn.gate BF16 (with INT4 fakequant noise baked in)
attn.wo_a BF16 (with INT4 fakequant noise baked in)
attn.{q,k,v,wo_b,wkv} FP8 native (unchanged)
Embeddings, norms, mtp head FP8 / BF16 native (unchanged)

Attention QKVO is intentionally kept at native FP8 — it's a small fraction of total parameters but quality-critical, so the bit-savings cost of leaving it native is small while the quality gain is large.

Loading

TP = 4 sharded. Use the harness loader:

from harness.lib.model_adapter import load_model_sharded, load_tokenizer
from harness.lib.int4_overlay_loader import apply_overlay

CKPT = "<path-to-this-snapshot>"
model = load_model_sharded(CKPT, "config-flash-base.json", max_seq_len=512, max_batch_size=1)
tokenizer = load_tokenizer(CKPT)
apply_overlay(model, CKPT, native_int4=False)   # FP8RT mode (default)

Runtime modes

Mode Storage in VRAM Decode tok/s (4 × H100) When to use
FP8RT (default) FP8 (re-encoded at load) 5.22 Default. Same speed as the FP8 baseline. INT4-quality weights with FP8-speed kernels.
native_int4 uint8 packed (~50 % VRAM saved on quantized portion) 2.87 Only if VRAM-bound. Marlin GEMM is currently 1.7 – 2.3 × slower than FP8 GEMM at V4's MoE shapes.

Prefill latencies (1 / 256 / 1024 tokens, ms): FP8 baseline 195 / 650 / 835, FP8RT 197 / 641 / 822, native_int4 330 / 1506 / 1928.

Reproducibility & verification

This release is bit-exact reproducible from the FP8 base + recipe. Verification was done at three levels:

  1. State equality: GPU-side per-tensor fingerprint diff between apply_recipe(fp8_base, recipe.yaml) and this checkpoint loaded via apply_overlay. Result: 0 / 18,455 parameters and 0 / 191 buffers drift, on every TP rank.
  2. Forward equality: tier-0 KL evaluated on a 500-prompt calibration cache — kl 0.064 / kl_p95 0.194 / agreement 0.922.
  3. Roundtrip integrity: this artifact was uploaded to HF, downloaded fresh to a clean machine, and re-evaluated → kl 0.0649 / kl_p95 0.1837 / agreement 0.8906 (matches local within metric noise). MMLU 88.0 % is also from the HF-downloaded copy.

Quantization is deterministic: same recipe + same FP8 base → byte-equal output.

Caveats

  • Format: safetensors (PyTorch / transformers ecosystem), TP = 4 sharded — not GGUF. V4-Flash-Base does not yet have llama.cpp / GGUF support; once it lands upstream, this checkpoint can be repacked.
  • native_int4 mode is slower than FP8 GEMM at V4's expert shapes. The default FP8RT mode sidesteps this — you get INT4 quality with FP8 speed. If you need the VRAM savings of packed storage at runtime, expect ~1.8 × decode latency.

License

This quantization inherits the upstream model license. See the base model card for terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EnsueAI/DeepSeek-V4-Flash-Base-INT4

Finetuned
(1)
this model