DeepSeek V4 Flash โ€” GGUF for ds4 (imatrix, aligned)

These are quants for the DS4 inference engine, built with an importance matrix (imatrix) and with tensor offsets aligned for efficient mmap on Apple Silicon. They are a drop-in replacement for the files published in antirez/deepseek-v4-gguf, with two changes:

  • The IQ2XXS quant was recomputed using an imatrix calibrated on chat-v2 traffic, which restores quality on tool-calling and instruction-following prompts that the original blind quant degraded.
  • All tensor data offsets are page-aligned, which lets the runtime mmap the file directly without an extra copy.

The MTP file is reproduced here unchanged so a single repo holds everything download_model.sh expects.

Files

File Size Routed experts (ffn_{gate,up,down}_exps) Everything else
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-aligned.gguf 81 GiB IQ2_XXS (gate, up) + Q2_K (down), imatrix-calibrated Q8_0 attn proj / shared experts / output, F16 router + embed + indexer + compressor + HC, F32 norms / sinks / bias
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf 3.5 GiB MTP / speculative-decoding support (optional, not standalone)
DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-aligned.dat 430 MiB Raw imatrix used to produce the IQ2XXS quant above (for reproducibility)

Use the IQ2XXS quant on 128 GB Mac machines, pair it with the MTP file for optional speculative decoding.

What the imatrix changes

IQ2_XXS is a 2.0625-bit-per-weight quant. At that budget the per-tensor scales matter a lot, and a blind quant tends to underweight the rows that carry tool-call tokens and rarely-routed experts. Calibrating against an imatrix gathered from the chat-v2 corpus shifts the scales toward those rows, which in practice recovers most of the regression seen on:

  • function/tool-call emission (well-formed JSON, correct argument names),
  • long-context instruction following,
  • code generation in the languages most represented in the calibration set.

The imatrix itself is shipped as DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-aligned.dat so the quant is fully reproducible.

Alignment

GGUF files store tensor data after a metadata header. By default the header length is whatever it happens to be, so tensor offsets are not aligned to a page boundary. On Apple Silicon, mmap-ing a misaligned tensor forces an extra copy into an aligned buffer before Metal can use it. The files here are padded so every tensor sits on a 4096-byte boundary, which lets DS4 hand the mapped pointers straight to Metal.

The alignment is transparent to GGUF readers โ€” any loader that respects the general.alignment field will work unchanged.

Quantization recipe

The filename is the spec. In detail:

Tensor class Quant Notes
blk.*.ffn_gate_exps, blk.*.ffn_up_exps IQ2_XXS routed-expert up/gate, imatrix-calibrated
blk.*.ffn_down_exps Q2_K routed-expert down (K-quant for quality)
blk.*.ffn_{gate,up,down}_shexp Q8_0 shared experts
blk.*.attn_q_a, attn_q_b, attn_kv, attn_output_a, attn_output_b Q8_0 all attention projections (MLA + low-rank output)
output.weight Q8_0 output head
token_embd.weight F16 input embedding
blk.*.ffn_gate_inp (router) F16 learned router
blk.*.exp_probs_b (router bias), blk.*.attn_sinks, all *_norm.weight F32
blk.*.ffn_gate_tid2eid I32 hash-routing tables (first 3 layers only)
blk.*.attn_compressor_*, blk.*.indexer_*, blk.*.hc_*, blk.*.output_hc_* F16 / F32 DSv4-specific auxiliary blocks

The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components at Q8_0 preserves model behavior; crushing the experts buys the size.

Usage

git clone https://github.com/antirez/ds4
cd ds4
# Fetch the imatrix-aligned q2 + MTP from this repo:
hf download jedisct1/DeepSeek-V4-Flash-imatrix-aligned \
    DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-aligned.gguf \
    DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf \
    --local-dir .
ln -sf DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-aligned.gguf ds4flash.gguf
make

./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

License

MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model's release terms.

Downloads last month
973
GGUF
Model size
7B params
Architecture
deepseek4_mtp_support
Hardware compatibility
Log In to add your hardware

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jedisct1/DeepSeek-V4-Flash-imatrix-aligned

Quantized
(64)
this model