DeepSeek V4 Flash — GGUF for ds4 (imatrix, aligned)

These are quants for the DS4 inference engine, built with an importance matrix (imatrix) and with tensor offsets aligned for efficient mmap on Apple Silicon. They are a drop-in replacement for the files published in antirez/deepseek-v4-gguf, with two changes:

The IQ2XXS quant was recomputed using an imatrix calibrated on chat-v2 traffic, which restores quality on tool-calling and instruction-following prompts that the original blind quant degraded.
All tensor data offsets are page-aligned, which lets the runtime mmap the file directly without an extra copy.

The MTP file is reproduced here unchanged so a single repo holds everything download_model.sh expects.

Files

File	Size	Routed experts (`ffn_{gate,up,down}_exps`)	Everything else
`DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-aligned.gguf`	81 GiB	`IQ2_XXS` (gate, up) + `Q2_K` (down), imatrix-calibrated	`Q8_0` attn proj / shared experts / output, `F16` router + embed + indexer + compressor + HC, `F32` norms / sinks / bias
`DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf`	3.5 GiB	MTP / speculative-decoding support (optional, not standalone)
`DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-aligned.dat`	430 MiB	Raw imatrix used to produce the IQ2XXS quant above (for reproducibility)

Use the IQ2XXS quant on 128 GB Mac machines, pair it with the MTP file for optional speculative decoding.

What the imatrix changes

IQ2_XXS is a 2.0625-bit-per-weight quant. At that budget the per-tensor scales matter a lot, and a blind quant tends to underweight the rows that carry tool-call tokens and rarely-routed experts. Calibrating against an imatrix gathered from the chat-v2 corpus shifts the scales toward those rows, which in practice recovers most of the regression seen on:

function/tool-call emission (well-formed JSON, correct argument names),
long-context instruction following,
code generation in the languages most represented in the calibration set.

The imatrix itself is shipped as DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-aligned.dat so the quant is fully reproducible.

Alignment

GGUF files store tensor data after a metadata header. By default the header length is whatever it happens to be, so tensor offsets are not aligned to a page boundary. On Apple Silicon, mmap-ing a misaligned tensor forces an extra copy into an aligned buffer before Metal can use it. The files here are padded so every tensor sits on a 4096-byte boundary, which lets DS4 hand the mapped pointers straight to Metal.

The alignment is transparent to GGUF readers — any loader that respects the general.alignment field will work unchanged.

Quantization recipe

The filename is the spec. In detail:

Tensor class	Quant	Notes
`blk..ffn_gate_exps`, `blk..ffn_up_exps`	`IQ2_XXS`	routed-expert up/gate, imatrix-calibrated
`blk.*.ffn_down_exps`	`Q2_K`	routed-expert down (K-quant for quality)
`blk.*.ffn_{gate,up,down}_shexp`	`Q8_0`	shared experts
`blk.*.attn_q_a`, `attn_q_b`, `attn_kv`, `attn_output_a`, `attn_output_b`	`Q8_0`	all attention projections (MLA + low-rank output)
`output.weight`	`Q8_0`	output head
`token_embd.weight`	`F16`	input embedding
`blk.*.ffn_gate_inp` (router)	`F16`	learned router
`blk..exp_probs_b` (router bias), `blk..attn_sinks`, all `*_norm.weight`	`F32`
`blk.*.ffn_gate_tid2eid`	`I32`	hash-routing tables (first 3 layers only)
`blk..attn_compressor_`, `blk..indexer_`, `blk..hc_`, `blk..output_hc_`	`F16` / `F32`	DSv4-specific auxiliary blocks

The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components at Q8_0 preserves model behavior; crushing the experts buys the size.

Usage

git clone https://github.com/antirez/ds4
cd ds4
# Fetch the imatrix-aligned q2 + MTP from this repo:
hf download jedisct1/DeepSeek-V4-Flash-imatrix-aligned \
    DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-aligned.gguf \
    DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf \
    --local-dir .
ln -sf DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-aligned.gguf ds4flash.gguf
make

./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

License

MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model's release terms.

Downloads last month: 996

GGUF

Model size

7B params

Architecture

deepseek4_mtp_support

Hardware compatibility

32-bit

View +1 variant

Model tree for jedisct1/DeepSeek-V4-Flash-imatrix-aligned

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(106)

this model