DeepSeek V4 Flash โ GGUF for ds4 (imatrix, aligned)
These are quants for the DS4 inference engine, built with an importance matrix (imatrix) and with tensor offsets aligned for efficient mmap on Apple Silicon. They are a drop-in replacement for the files published in antirez/deepseek-v4-gguf, with two changes:
- The IQ2XXS quant was recomputed using an imatrix calibrated on chat-v2 traffic, which restores quality on tool-calling and instruction-following prompts that the original blind quant degraded.
- All tensor data offsets are page-aligned, which lets the runtime mmap the file directly without an extra copy.
The MTP file is reproduced here unchanged so a single repo holds everything download_model.sh expects.
Files
| File | Size | Routed experts (ffn_{gate,up,down}_exps) |
Everything else |
|---|---|---|---|
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-aligned.gguf |
81 GiB | IQ2_XXS (gate, up) + Q2_K (down), imatrix-calibrated |
Q8_0 attn proj / shared experts / output, F16 router + embed + indexer + compressor + HC, F32 norms / sinks / bias |
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf |
3.5 GiB | MTP / speculative-decoding support (optional, not standalone) | |
DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-aligned.dat |
430 MiB | Raw imatrix used to produce the IQ2XXS quant above (for reproducibility) |
Use the IQ2XXS quant on 128 GB Mac machines, pair it with the MTP file for optional speculative decoding.
What the imatrix changes
IQ2_XXS is a 2.0625-bit-per-weight quant. At that budget the per-tensor scales matter a lot, and a blind quant tends to underweight the rows that carry tool-call tokens and rarely-routed experts. Calibrating against an imatrix gathered from the chat-v2 corpus shifts the scales toward those rows, which in practice recovers most of the regression seen on:
- function/tool-call emission (well-formed JSON, correct argument names),
- long-context instruction following,
- code generation in the languages most represented in the calibration set.
The imatrix itself is shipped as DeepSeek-V4-Flash-chat-v2-routed-moe-ds4-aligned.dat so the quant is fully reproducible.
Alignment
GGUF files store tensor data after a metadata header. By default the header length is whatever it happens to be, so tensor offsets are not aligned to a page boundary. On Apple Silicon, mmap-ing a misaligned tensor forces an extra copy into an aligned buffer before Metal can use it. The files here are padded so every tensor sits on a 4096-byte boundary, which lets DS4 hand the mapped pointers straight to Metal.
The alignment is transparent to GGUF readers โ any loader that respects the general.alignment field will work unchanged.
Quantization recipe
The filename is the spec. In detail:
| Tensor class | Quant | Notes |
|---|---|---|
blk.*.ffn_gate_exps, blk.*.ffn_up_exps |
IQ2_XXS |
routed-expert up/gate, imatrix-calibrated |
blk.*.ffn_down_exps |
Q2_K |
routed-expert down (K-quant for quality) |
blk.*.ffn_{gate,up,down}_shexp |
Q8_0 |
shared experts |
blk.*.attn_q_a, attn_q_b, attn_kv, attn_output_a, attn_output_b |
Q8_0 |
all attention projections (MLA + low-rank output) |
output.weight |
Q8_0 |
output head |
token_embd.weight |
F16 |
input embedding |
blk.*.ffn_gate_inp (router) |
F16 |
learned router |
blk.*.exp_probs_b (router bias), blk.*.attn_sinks, all *_norm.weight |
F32 |
|
blk.*.ffn_gate_tid2eid |
I32 |
hash-routing tables (first 3 layers only) |
blk.*.attn_compressor_*, blk.*.indexer_*, blk.*.hc_*, blk.*.output_hc_* |
F16 / F32 |
DSv4-specific auxiliary blocks |
The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components at Q8_0 preserves model behavior; crushing the experts buys the size.
Usage
git clone https://github.com/antirez/ds4
cd ds4
# Fetch the imatrix-aligned q2 + MTP from this repo:
hf download jedisct1/DeepSeek-V4-Flash-imatrix-aligned \
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-aligned.gguf \
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf \
--local-dir .
ln -sf DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix-aligned.gguf ds4flash.gguf
make
./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
License
MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model's release terms.
- Downloads last month
- 973
32-bit
Model tree for jedisct1/DeepSeek-V4-Flash-imatrix-aligned
Base model
deepseek-ai/DeepSeek-V4-Flash