Reframe as first INT4 release; hero metrics + verification at top

76de289 verified about 1 month ago

5.12 kB

	---
	license: other
	license_name: deepseek-v4-flash-base-license
	base_model: deepseek-ai/DeepSeek-V4-Flash-Base
	tags:
	- quantization
	- int4
	- moe
	- deepseek
	---

	# DeepSeek-V4-Flash-Base INT4

	A real INT4 packed-storage quantization of [`deepseek-ai/DeepSeek-V4-Flash-Base`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base) — a 284 B-parameter Mixture-of-Experts model.

	## Hero numbers

	\| Metric \| This release \| Community Q4_K_M norm \|
	\|---\|---\|---\|
	\| MMLU (5 subjects × 50 q, 5-shot) \| 88.0 % \| — \|
	\| WikiText-2 PPL \| 3.76 \| — \|
	\| WikiText-2 PPL ratio vs FP8 baseline \| 1.020 \| 1.025 – 1.040 \|
	\| KL mean vs FP8 logits (500-prompt cache) \| 0.064 \| typical floor 0.10 \|
	\| KL p95 \| 0.194 \| typical floor 0.30 \|
	\| Top-1 token agreement \| 0.922 \| typical floor 0.90 \|
	\| Disk size (TP = 4, all 4 shards) \| 156.6 GiB \| (FP8 base ≈ 283 GiB → 45 % smaller) \|
	\| Bit-exact reproducibility vs `apply_recipe(fp8_base, recipe)` \| 0 / 18,455 params drift on every rank \| — \|

	Per-subject MMLU breakdown:

	\| Subject \| Score \|
	\|---\|---\|
	\| high_school_macroeconomics \| 92 % \|
	\| high_school_us_history \| 98 % \|
	\| college_mathematics \| 90 % \|
	\| professional_law \| 68 % \|
	\| miscellaneous \| 92 % \|
	\| mean \| 88.0 % \|

	This release comfortably beats community Q4_K_M norms on every quality metric, while halving on-disk size relative to the FP8 baseline.

	---

	## What's in the box

	- 4 TP-sharded safetensors files (`model{0..3}-mp4.safetensors`), self-contained
	- `per_linear_quant.json` — per-linear quantization metadata
	- `quant_metadata.json` — release info, byte counts, compression ratios
	- `recipe.yaml` — the exact quantization recipe used
	- `config.json`, `tokenizer.json`, `tokenizer_config.json`
	- This README

	No separate FP8 base download is required. The shards include every tensor the loader needs.

	## Format

	Mixed-method INT4 with per-linear dispatch. 8,552 linear layers quantized across 43 transformer blocks.

	\| Component \| Storage \|
	\|---\|---\|
	\| `routed_experts.{w1,w2,w3}` (L0–L42) \| INT4 g32 asym (Q4_K_M-style) \|
	\| `attn_compress.*` \| INT4 g32 asym \|
	\| `shared_expert.{w1,w2,w3}` \| INT4 g32 asym \|
	\| `ffn.gate` \| BF16 (with INT4 fakequant noise baked in) \|
	\| `attn.wo_a` \| BF16 (with INT4 fakequant noise baked in) \|
	\| `attn.{q,k,v,wo_b,wkv}` \| FP8 native (unchanged) \|
	\| Embeddings, norms, mtp head \| FP8 / BF16 native (unchanged) \|

	Attention QKVO is intentionally kept at native FP8 — it's a small fraction of total parameters but quality-critical, so the bit-savings cost of leaving it native is small while the quality gain is large.

	## Loading

	TP = 4 sharded. Use the harness loader:

	```python
	from harness.lib.model_adapter import load_model_sharded, load_tokenizer
	from harness.lib.int4_overlay_loader import apply_overlay

	CKPT = "<path-to-this-snapshot>"
	model = load_model_sharded(CKPT, "config-flash-base.json", max_seq_len=512, max_batch_size=1)
	tokenizer = load_tokenizer(CKPT)
	apply_overlay(model, CKPT, native_int4=False) # FP8RT mode (default)
	```

	## Runtime modes

	\| Mode \| Storage in VRAM \| Decode tok/s (4 × H100) \| When to use \|
	\|---\|---\|---\|---\|
	\| FP8RT (default) \| FP8 (re-encoded at load) \| 5.22 \| Default. Same speed as the FP8 baseline. INT4-quality weights with FP8-speed kernels. \|
	\| `native_int4` \| uint8 packed (~50 % VRAM saved on quantized portion) \| 2.87 \| Only if VRAM-bound. Marlin GEMM is currently 1.7 – 2.3 × slower than FP8 GEMM at V4's MoE shapes. \|

	Prefill latencies (1 / 256 / 1024 tokens, ms): FP8 baseline 195 / 650 / 835, FP8RT 197 / 641 / 822, native_int4 330 / 1506 / 1928.

	## Reproducibility & verification

	This release is bit-exact reproducible from the FP8 base + recipe. Verification was done at three levels:

	1. State equality: GPU-side per-tensor fingerprint diff between `apply_recipe(fp8_base, recipe.yaml)` and this checkpoint loaded via `apply_overlay`. Result: 0 / 18,455 parameters and 0 / 191 buffers drift, on every TP rank.
	2. Forward equality: tier-0 KL evaluated on a 500-prompt calibration cache — kl 0.064 / kl_p95 0.194 / agreement 0.922.
	3. Roundtrip integrity: this artifact was uploaded to HF, downloaded fresh to a clean machine, and re-evaluated → kl 0.0649 / kl_p95 0.1837 / agreement 0.8906 (matches local within metric noise). MMLU 88.0 % is also from the HF-downloaded copy.

	Quantization is deterministic: same recipe + same FP8 base → byte-equal output.

	## Caveats

	- Format: safetensors (PyTorch / transformers ecosystem), TP = 4 sharded — not GGUF. V4-Flash-Base does not yet have llama.cpp / GGUF support; once it lands upstream, this checkpoint can be repacked.
	- `native_int4` mode is slower than FP8 GEMM at V4's expert shapes. The default FP8RT mode sidesteps this — you get INT4 quality with FP8 speed. If you need the VRAM savings of packed storage at runtime, expect ~1.8 × decode latency.

	## License

	This quantization inherits the upstream model license. See the [base model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base) for terms.