---
license: other
license_name: deepseek-v4-flash-base-license
base_model: deepseek-ai/DeepSeek-V4-Flash-Base
tags:
- quantization
- int4
- moe
- deepseek
---

# DeepSeek-V4-Flash-Base INT4

A real INT4 packed-storage quantization of [`deepseek-ai/DeepSeek-V4-Flash-Base`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base) — a 284 B-parameter Mixture-of-Experts model.

## Hero numbers

| Metric | This release | Community Q4_K_M norm |
|---|---|---|
| **MMLU** (5 subjects × 50 q, 5-shot) | **88.0 %** | — |
| **WikiText-2 PPL** | 3.76 | — |
| **WikiText-2 PPL ratio** vs FP8 baseline | **1.020** | 1.025 – 1.040 |
| **KL mean** vs FP8 logits (500-prompt cache) | **0.064** | typical floor 0.10 |
| **KL p95** | **0.194** | typical floor 0.30 |
| **Top-1 token agreement** | **0.922** | typical floor 0.90 |
| **Disk size** (TP = 4, all 4 shards) | **156.6 GiB** | (FP8 base ≈ 283 GiB → 45 % smaller) |
| **Bit-exact reproducibility** vs `apply_recipe(fp8_base, recipe)` | 0 / 18,455 params drift on every rank | — |

Per-subject MMLU breakdown:

| Subject | Score |
|---|---|
| high_school_macroeconomics | 92 % |
| high_school_us_history | 98 % |
| college_mathematics | 90 % |
| professional_law | 68 % |
| miscellaneous | 92 % |
| **mean** | **88.0 %** |

This release **comfortably beats community Q4_K_M norms** on every quality metric, while halving on-disk size relative to the FP8 baseline.

---

## What's in the box

- 4 TP-sharded safetensors files (`model{0..3}-mp4.safetensors`), self-contained
- `per_linear_quant.json` — per-linear quantization metadata
- `quant_metadata.json` — release info, byte counts, compression ratios
- `recipe.yaml` — the exact quantization recipe used
- `config.json`, `tokenizer.json`, `tokenizer_config.json`
- This README

No separate FP8 base download is required. The shards include every tensor the loader needs.

## Format

Mixed-method INT4 with per-linear dispatch. **8,552 linear layers** quantized across 43 transformer blocks.

| Component | Storage |
|---|---|
| `routed_experts.{w1,w2,w3}` (L0–L42) | INT4 g32 asym (Q4_K_M-style) |
| `attn_compress.*` | INT4 g32 asym |
| `shared_expert.{w1,w2,w3}` | INT4 g32 asym |
| `ffn.gate` | BF16 (with INT4 fakequant noise baked in) |
| `attn.wo_a` | BF16 (with INT4 fakequant noise baked in) |
| `attn.{q,k,v,wo_b,wkv}` | FP8 native (unchanged) |
| Embeddings, norms, mtp head | FP8 / BF16 native (unchanged) |

Attention QKVO is intentionally kept at native FP8 — it's a small fraction of total parameters but quality-critical, so the bit-savings cost of leaving it native is small while the quality gain is large.

## Loading

TP = 4 sharded. Use the harness loader:

```python
from harness.lib.model_adapter import load_model_sharded, load_tokenizer
from harness.lib.int4_overlay_loader import apply_overlay

CKPT = "<path-to-this-snapshot>"
model = load_model_sharded(CKPT, "config-flash-base.json", max_seq_len=512, max_batch_size=1)
tokenizer = load_tokenizer(CKPT)
apply_overlay(model, CKPT, native_int4=False)   # FP8RT mode (default)
```

## Runtime modes

| Mode | Storage in VRAM | Decode tok/s (4 × H100) | When to use |
|---|---|---|---|
| **FP8RT (default)** | FP8 (re-encoded at load) | **5.22** | Default. Same speed as the FP8 baseline. INT4-quality weights with FP8-speed kernels. |
| `native_int4` | uint8 packed (~50 % VRAM saved on quantized portion) | 2.87 | Only if VRAM-bound. Marlin GEMM is currently 1.7 – 2.3 × slower than FP8 GEMM at V4's MoE shapes. |

Prefill latencies (1 / 256 / 1024 tokens, ms): FP8 baseline 195 / 650 / 835, FP8RT 197 / 641 / 822, native_int4 330 / 1506 / 1928.

## Reproducibility & verification

This release is **bit-exact reproducible** from the FP8 base + recipe. Verification was done at three levels:

1. **State equality**: GPU-side per-tensor fingerprint diff between `apply_recipe(fp8_base, recipe.yaml)` and this checkpoint loaded via `apply_overlay`. Result: 0 / 18,455 parameters and 0 / 191 buffers drift, on every TP rank.
2. **Forward equality**: tier-0 KL evaluated on a 500-prompt calibration cache — kl 0.064 / kl_p95 0.194 / agreement 0.922.
3. **Roundtrip integrity**: this artifact was uploaded to HF, downloaded fresh to a clean machine, and re-evaluated → kl 0.0649 / kl_p95 0.1837 / agreement 0.8906 (matches local within metric noise). MMLU 88.0 % is also from the HF-downloaded copy.

Quantization is deterministic: same recipe + same FP8 base → byte-equal output.

## Caveats

- **Format:** safetensors (PyTorch / transformers ecosystem), TP = 4 sharded — **not GGUF**. V4-Flash-Base does not yet have llama.cpp / GGUF support; once it lands upstream, this checkpoint can be repacked.
- **`native_int4` mode is slower** than FP8 GEMM at V4's expert shapes. The default FP8RT mode sidesteps this — you get INT4 quality with FP8 speed. If you need the VRAM savings of packed storage at runtime, expect ~1.8 × decode latency.

## License

This quantization inherits the upstream model license. See the [base model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base) for terms.