--- license: other license_name: deepseek-v4-flash-base-license base_model: deepseek-ai/DeepSeek-V4-Flash-Base tags: - quantization - int4 - moe - deepseek --- # DeepSeek-V4-Flash-Base INT4 A real INT4 packed-storage quantization of [`deepseek-ai/DeepSeek-V4-Flash-Base`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base) — a 284 B-parameter Mixture-of-Experts model. ## Hero numbers | Metric | This release | Community Q4_K_M norm | |---|---|---| | **MMLU** (5 subjects × 50 q, 5-shot) | **88.0 %** | — | | **WikiText-2 PPL** | 3.76 | — | | **WikiText-2 PPL ratio** vs FP8 baseline | **1.020** | 1.025 – 1.040 | | **KL mean** vs FP8 logits (500-prompt cache) | **0.064** | typical floor 0.10 | | **KL p95** | **0.194** | typical floor 0.30 | | **Top-1 token agreement** | **0.922** | typical floor 0.90 | | **Disk size** (TP = 4, all 4 shards) | **156.6 GiB** | (FP8 base ≈ 283 GiB → 45 % smaller) | | **Bit-exact reproducibility** vs `apply_recipe(fp8_base, recipe)` | 0 / 18,455 params drift on every rank | — | Per-subject MMLU breakdown: | Subject | Score | |---|---| | high_school_macroeconomics | 92 % | | high_school_us_history | 98 % | | college_mathematics | 90 % | | professional_law | 68 % | | miscellaneous | 92 % | | **mean** | **88.0 %** | This release **comfortably beats community Q4_K_M norms** on every quality metric, while halving on-disk size relative to the FP8 baseline. --- ## What's in the box - 4 TP-sharded safetensors files (`model{0..3}-mp4.safetensors`), self-contained - `per_linear_quant.json` — per-linear quantization metadata - `quant_metadata.json` — release info, byte counts, compression ratios - `recipe.yaml` — the exact quantization recipe used - `config.json`, `tokenizer.json`, `tokenizer_config.json` - This README No separate FP8 base download is required. The shards include every tensor the loader needs. ## Format Mixed-method INT4 with per-linear dispatch. **8,552 linear layers** quantized across 43 transformer blocks. | Component | Storage | |---|---| | `routed_experts.{w1,w2,w3}` (L0–L42) | INT4 g32 asym (Q4_K_M-style) | | `attn_compress.*` | INT4 g32 asym | | `shared_expert.{w1,w2,w3}` | INT4 g32 asym | | `ffn.gate` | BF16 (with INT4 fakequant noise baked in) | | `attn.wo_a` | BF16 (with INT4 fakequant noise baked in) | | `attn.{q,k,v,wo_b,wkv}` | FP8 native (unchanged) | | Embeddings, norms, mtp head | FP8 / BF16 native (unchanged) | Attention QKVO is intentionally kept at native FP8 — it's a small fraction of total parameters but quality-critical, so the bit-savings cost of leaving it native is small while the quality gain is large. ## Loading TP = 4 sharded. Use the harness loader: ```python from harness.lib.model_adapter import load_model_sharded, load_tokenizer from harness.lib.int4_overlay_loader import apply_overlay CKPT = "" model = load_model_sharded(CKPT, "config-flash-base.json", max_seq_len=512, max_batch_size=1) tokenizer = load_tokenizer(CKPT) apply_overlay(model, CKPT, native_int4=False) # FP8RT mode (default) ``` ## Runtime modes | Mode | Storage in VRAM | Decode tok/s (4 × H100) | When to use | |---|---|---|---| | **FP8RT (default)** | FP8 (re-encoded at load) | **5.22** | Default. Same speed as the FP8 baseline. INT4-quality weights with FP8-speed kernels. | | `native_int4` | uint8 packed (~50 % VRAM saved on quantized portion) | 2.87 | Only if VRAM-bound. Marlin GEMM is currently 1.7 – 2.3 × slower than FP8 GEMM at V4's MoE shapes. | Prefill latencies (1 / 256 / 1024 tokens, ms): FP8 baseline 195 / 650 / 835, FP8RT 197 / 641 / 822, native_int4 330 / 1506 / 1928. ## Reproducibility & verification This release is **bit-exact reproducible** from the FP8 base + recipe. Verification was done at three levels: 1. **State equality**: GPU-side per-tensor fingerprint diff between `apply_recipe(fp8_base, recipe.yaml)` and this checkpoint loaded via `apply_overlay`. Result: 0 / 18,455 parameters and 0 / 191 buffers drift, on every TP rank. 2. **Forward equality**: tier-0 KL evaluated on a 500-prompt calibration cache — kl 0.064 / kl_p95 0.194 / agreement 0.922. 3. **Roundtrip integrity**: this artifact was uploaded to HF, downloaded fresh to a clean machine, and re-evaluated → kl 0.0649 / kl_p95 0.1837 / agreement 0.8906 (matches local within metric noise). MMLU 88.0 % is also from the HF-downloaded copy. Quantization is deterministic: same recipe + same FP8 base → byte-equal output. ## Caveats - **Format:** safetensors (PyTorch / transformers ecosystem), TP = 4 sharded — **not GGUF**. V4-Flash-Base does not yet have llama.cpp / GGUF support; once it lands upstream, this checkpoint can be repacked. - **`native_int4` mode is slower** than FP8 GEMM at V4's expert shapes. The default FP8RT mode sidesteps this — you get INT4 quality with FP8 speed. If you need the VRAM savings of packed storage at runtime, expect ~1.8 × decode latency. ## License This quantization inherits the upstream model license. See the [base model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base) for terms.