DSv4-Flash-MXFP4-native-flash

DeepSeek V4 Flash packaged for SSD-streamed inference on Apple Silicon with ds4-ssd. The routed experts are the native MXFP4 weights, bit-exact with the original DeepSeek release — no requantization — stored in a layer-major expert sidecar that ds4 pages from SSD through a Metal slot-bank cache, so the 156 GB model runs on machines that cannot hold it resident (validated on M3 Ultra 96 GB and M5 Max 128 GB).

Contents

path what it is size
manifest.json sidecar manifest (layout layer_major_expert, 43 layers, 256 experts)
layer_000.bin … layer_042.bin routed-expert records, native MXFP4 (e8m0 scale + 32×e2m1, ggml block_mxfp4 nibble order), bit-exact with the HF safetensors 43 × 3.42 GB
dense/model-dense.gguf dense/shared tensors (attention, shared expert, embeddings, head) as Q8_0/F16 GGUF 8.8 GB
dense/flashmoe-package.json package descriptor

Fidelity vs the original HF model: expert bytes are bit-exact; the only loss in the chain is the dense FP8→Q8_0 re-encode at 0.55% relRMS (~4–5× below the FP8 grid's own step). Embeddings are exact. Graded QA spot-checks (GPQA/SuperGPQA via ds4-eval) score identically to the Q4_K reference package, and the OpenAI-API server smoke passes.

Quickstart (SSD streaming)

Requirements: Apple Silicon Mac with 96 GB+ unified memory, macOS, ~156 GB free on a fast SSD, Xcode command line tools, and the Hugging Face CLI.

# build the runtime
git clone https://github.com/Anemll/ds4-ssd.git
cd ds4-ssd
make

# download this package (~156 GB, resumable; rerun to resume)
./download_model.sh mxfp4

# run with SSD streaming
./ds4 -m models/DSv4-Flash-MXFP4-native-flash --ssd-cache auto -p "What is the Apple Neural Engine?"

Or point -m at any copy of the package directory:

./ds4 -m /path/to/DSv4-Flash-MXFP4-native-flash --ssd-cache 32GB --ctx 32768 -p "Hello"

ds4 auto-detects the sidecar from manifest.json + dense/model-dense.gguf; no extra flags are needed.

Sizing the expert cache

--ssd-cache sizes the resident routed-expert cache (auto, or an explicit budget like 32GB). Any size is safe: the bank is clamped at startup so prefill cannot overflow memory, and on RAM-limited machines it automatically shrinks after prefill so decode-miss reads stay served by the OS file cache at RAM speed (the wired bank would otherwise evict that cache and collapse decode throughput). Startup logs show what was resolved:

ds4: --ssd-cache 48GB resolved Flash-MoE slot bank: slots=89 gpu-bank=47.65 GiB
ds4: Flash-MoE shrinking decode slot bank after full prefill: layers=43 slots 89->31
ds4: prefill: ... generation: ...

Recommended starting points:

machine setting
96 GB (M3 Ultra) --ssd-cache auto (or 16GB..32GB)
128 GB (M5 Max) --ssd-cache auto (or 32GB..48GB)

Server and agent

An OpenAI-compatible local server and a coding-agent frontend ship in the same repo:

./ds4-server -m /path/to/DSv4-Flash-MXFP4-native-flash --ssd-cache auto --ctx 100000
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}]}'
./ds4-agent -m /path/to/DSv4-Flash-MXFP4-native-flash --ssd-cache auto

Measured throughput

machine 16k prefill (ANE i8i8) decode @16k ctx short-ctx decode
M5 Max 128 GB ~316 t/s ~4.6 t/s ~10 t/s
M3 Ultra 96 GB ~3–4 t/s

On the M5 Max this package matches or slightly beats the Q4_K reference (315.9 vs 312.9 t/s prefill, 4.59 vs 4.47 t/s decode @16k) while keeping the experts bit-exact. M3 Ultra numbers are with the automatic decode-bank shrink; decode there is bounded by SSD/file-cache miss IO, not by compute.

Notes

  • The dense GGUF here is the converted Q8_0/F16 file that stock ds4 loads directly. The converter that produced it from the native FP8 dense export ships in the repo as fp4_samples/convert_native_dense_to_ds4.py.
  • MXFP4 expert records use the ggml split-half nibble order (qs[j] low nibble = element j, high = element j+16); values are identical to the HF sequential-pair packing, bytes are not.
  • MXFP4 qualifies for the same ANE i8i8 (W8A8) prefill backend as Q4_K, plus dedicated mul_mm_id / mul_mv_id Metal dequant kernels for GPU paths.
  • Tuning knobs are documented in docs/STREAMING_KNOBS.md.

License

MIT, following the upstream DeepSeek-V4-Flash release. The ds4-ssd runtime carries its own license in the repo.

Downloads last month
6
GGUF
Model size
7B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anemll/DSv4-Flash-MXFP4-native-flash

Quantized
(82)
this model