| --- |
| license: other |
| license_name: deepseek-v4-flash-base-license |
| base_model: deepseek-ai/DeepSeek-V4-Flash-Base |
| tags: |
| - quantization |
| - int4 |
| - moe |
| - deepseek |
| --- |
| |
| # DeepSeek-V4-Flash-Base INT4 |
|
|
| A real INT4 packed-storage quantization of [`deepseek-ai/DeepSeek-V4-Flash-Base`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base) — a 284 B-parameter Mixture-of-Experts model. |
|
|
| ## Hero numbers |
|
|
| | Metric | This release | Community Q4_K_M norm | |
| |---|---|---| |
| | **MMLU** (5 subjects × 50 q, 5-shot) | **88.0 %** | — | |
| | **WikiText-2 PPL** | 3.76 | — | |
| | **WikiText-2 PPL ratio** vs FP8 baseline | **1.020** | 1.025 – 1.040 | |
| | **KL mean** vs FP8 logits (500-prompt cache) | **0.064** | typical floor 0.10 | |
| | **KL p95** | **0.194** | typical floor 0.30 | |
| | **Top-1 token agreement** | **0.922** | typical floor 0.90 | |
| | **Disk size** (TP = 4, all 4 shards) | **156.6 GiB** | (FP8 base ≈ 283 GiB → 45 % smaller) | |
| | **Bit-exact reproducibility** vs `apply_recipe(fp8_base, recipe)` | 0 / 18,455 params drift on every rank | — | |
|
|
| Per-subject MMLU breakdown: |
|
|
| | Subject | Score | |
| |---|---| |
| | high_school_macroeconomics | 92 % | |
| | high_school_us_history | 98 % | |
| | college_mathematics | 90 % | |
| | professional_law | 68 % | |
| | miscellaneous | 92 % | |
| | **mean** | **88.0 %** | |
| |
| This release **comfortably beats community Q4_K_M norms** on every quality metric, while halving on-disk size relative to the FP8 baseline. |
| |
| --- |
| |
| ## What's in the box |
| |
| - 4 TP-sharded safetensors files (`model{0..3}-mp4.safetensors`), self-contained |
| - `per_linear_quant.json` — per-linear quantization metadata |
| - `quant_metadata.json` — release info, byte counts, compression ratios |
| - `recipe.yaml` — the exact quantization recipe used |
| - `config.json`, `tokenizer.json`, `tokenizer_config.json` |
| - This README |
|
|
| No separate FP8 base download is required. The shards include every tensor the loader needs. |
|
|
| ## Format |
|
|
| Mixed-method INT4 with per-linear dispatch. **8,552 linear layers** quantized across 43 transformer blocks. |
|
|
| | Component | Storage | |
| |---|---| |
| | `routed_experts.{w1,w2,w3}` (L0–L42) | INT4 g32 asym (Q4_K_M-style) | |
| | `attn_compress.*` | INT4 g32 asym | |
| | `shared_expert.{w1,w2,w3}` | INT4 g32 asym | |
| | `ffn.gate` | BF16 (with INT4 fakequant noise baked in) | |
| | `attn.wo_a` | BF16 (with INT4 fakequant noise baked in) | |
| | `attn.{q,k,v,wo_b,wkv}` | FP8 native (unchanged) | |
| | Embeddings, norms, mtp head | FP8 / BF16 native (unchanged) | |
|
|
| Attention QKVO is intentionally kept at native FP8 — it's a small fraction of total parameters but quality-critical, so the bit-savings cost of leaving it native is small while the quality gain is large. |
|
|
| ## Loading |
|
|
| TP = 4 sharded. Use the harness loader: |
|
|
| ```python |
| from harness.lib.model_adapter import load_model_sharded, load_tokenizer |
| from harness.lib.int4_overlay_loader import apply_overlay |
| |
| CKPT = "<path-to-this-snapshot>" |
| model = load_model_sharded(CKPT, "config-flash-base.json", max_seq_len=512, max_batch_size=1) |
| tokenizer = load_tokenizer(CKPT) |
| apply_overlay(model, CKPT, native_int4=False) # FP8RT mode (default) |
| ``` |
|
|
| ## Runtime modes |
|
|
| | Mode | Storage in VRAM | Decode tok/s (4 × H100) | When to use | |
| |---|---|---|---| |
| | **FP8RT (default)** | FP8 (re-encoded at load) | **5.22** | Default. Same speed as the FP8 baseline. INT4-quality weights with FP8-speed kernels. | |
| | `native_int4` | uint8 packed (~50 % VRAM saved on quantized portion) | 2.87 | Only if VRAM-bound. Marlin GEMM is currently 1.7 – 2.3 × slower than FP8 GEMM at V4's MoE shapes. | |
|
|
| Prefill latencies (1 / 256 / 1024 tokens, ms): FP8 baseline 195 / 650 / 835, FP8RT 197 / 641 / 822, native_int4 330 / 1506 / 1928. |
| |
| ## Reproducibility & verification |
| |
| This release is **bit-exact reproducible** from the FP8 base + recipe. Verification was done at three levels: |
| |
| 1. **State equality**: GPU-side per-tensor fingerprint diff between `apply_recipe(fp8_base, recipe.yaml)` and this checkpoint loaded via `apply_overlay`. Result: 0 / 18,455 parameters and 0 / 191 buffers drift, on every TP rank. |
| 2. **Forward equality**: tier-0 KL evaluated on a 500-prompt calibration cache — kl 0.064 / kl_p95 0.194 / agreement 0.922. |
| 3. **Roundtrip integrity**: this artifact was uploaded to HF, downloaded fresh to a clean machine, and re-evaluated → kl 0.0649 / kl_p95 0.1837 / agreement 0.8906 (matches local within metric noise). MMLU 88.0 % is also from the HF-downloaded copy. |
|
|
| Quantization is deterministic: same recipe + same FP8 base → byte-equal output. |
|
|
| ## Caveats |
|
|
| - **Format:** safetensors (PyTorch / transformers ecosystem), TP = 4 sharded — **not GGUF**. V4-Flash-Base does not yet have llama.cpp / GGUF support; once it lands upstream, this checkpoint can be repacked. |
| - **`native_int4` mode is slower** than FP8 GEMM at V4's expert shapes. The default FP8RT mode sidesteps this — you get INT4 quality with FP8 speed. If you need the VRAM savings of packed storage at runtime, expect ~1.8 × decode latency. |
| |
| ## License |
| |
| This quantization inherits the upstream model license. See the [base model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base) for terms. |
| |