Instructions to use anemll/DSv4-Flash-MXFP4-native-flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use anemll/DSv4-Flash-MXFP4-native-flash with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="anemll/DSv4-Flash-MXFP4-native-flash", filename="dense/model-dense.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use anemll/DSv4-Flash-MXFP4-native-flash with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf anemll/DSv4-Flash-MXFP4-native-flash # Run inference directly in the terminal: llama cli -hf anemll/DSv4-Flash-MXFP4-native-flash
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf anemll/DSv4-Flash-MXFP4-native-flash # Run inference directly in the terminal: llama cli -hf anemll/DSv4-Flash-MXFP4-native-flash
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf anemll/DSv4-Flash-MXFP4-native-flash # Run inference directly in the terminal: ./llama-cli -hf anemll/DSv4-Flash-MXFP4-native-flash
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf anemll/DSv4-Flash-MXFP4-native-flash # Run inference directly in the terminal: ./build/bin/llama-cli -hf anemll/DSv4-Flash-MXFP4-native-flash
Use Docker
docker model run hf.co/anemll/DSv4-Flash-MXFP4-native-flash
- LM Studio
- Jan
- Ollama
How to use anemll/DSv4-Flash-MXFP4-native-flash with Ollama:
ollama run hf.co/anemll/DSv4-Flash-MXFP4-native-flash
- Unsloth Studio
How to use anemll/DSv4-Flash-MXFP4-native-flash with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for anemll/DSv4-Flash-MXFP4-native-flash to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for anemll/DSv4-Flash-MXFP4-native-flash to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for anemll/DSv4-Flash-MXFP4-native-flash to start chatting
- Pi
How to use anemll/DSv4-Flash-MXFP4-native-flash with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf anemll/DSv4-Flash-MXFP4-native-flash
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "anemll/DSv4-Flash-MXFP4-native-flash" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use anemll/DSv4-Flash-MXFP4-native-flash with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf anemll/DSv4-Flash-MXFP4-native-flash
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default anemll/DSv4-Flash-MXFP4-native-flash
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use anemll/DSv4-Flash-MXFP4-native-flash with Docker Model Runner:
docker model run hf.co/anemll/DSv4-Flash-MXFP4-native-flash
- Lemonade
How to use anemll/DSv4-Flash-MXFP4-native-flash with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull anemll/DSv4-Flash-MXFP4-native-flash
Run and chat with the model
lemonade run user.DSv4-Flash-MXFP4-native-flash-{{QUANT_TAG}}List all available models
lemonade list
DSv4-Flash-MXFP4-native-flash
DeepSeek V4 Flash packaged for SSD-streamed inference on Apple Silicon with ds4-ssd. The routed experts are the native MXFP4 weights, bit-exact with the original DeepSeek release — no requantization — stored in a layer-major expert sidecar that ds4 pages from SSD through a Metal slot-bank cache, so the 156 GB model runs on machines that cannot hold it resident (validated on M3 Ultra 96 GB and M5 Max 128 GB).
Contents
| path | what it is | size |
|---|---|---|
manifest.json |
sidecar manifest (layout layer_major_expert, 43 layers, 256 experts) |
— |
layer_000.bin … layer_042.bin |
routed-expert records, native MXFP4 (e8m0 scale + 32×e2m1, ggml block_mxfp4 nibble order), bit-exact with the HF safetensors |
43 × 3.42 GB |
dense/model-dense.gguf |
dense/shared tensors (attention, shared expert, embeddings, head) as Q8_0/F16 GGUF | 8.8 GB |
dense/flashmoe-package.json |
package descriptor | — |
Fidelity vs the original HF model: expert bytes are bit-exact; the only loss in the chain is the dense FP8→Q8_0 re-encode at 0.55% relRMS (~4–5× below the FP8 grid's own step). Embeddings are exact. Graded QA spot-checks (GPQA/SuperGPQA via ds4-eval) score identically to the Q4_K reference package, and the OpenAI-API server smoke passes.
Quickstart (SSD streaming)
Requirements: Apple Silicon Mac with 96 GB+ unified memory, macOS, ~156 GB free on a fast SSD, Xcode command line tools, and the Hugging Face CLI.
# build the runtime
git clone https://github.com/Anemll/ds4-ssd.git
cd ds4-ssd
make
# download this package (~156 GB, resumable; rerun to resume)
./download_model.sh mxfp4
# run with SSD streaming
./ds4 -m models/DSv4-Flash-MXFP4-native-flash --ssd-cache auto -p "What is the Apple Neural Engine?"
Or point -m at any copy of the package directory:
./ds4 -m /path/to/DSv4-Flash-MXFP4-native-flash --ssd-cache 32GB --ctx 32768 -p "Hello"
ds4 auto-detects the sidecar from manifest.json + dense/model-dense.gguf;
no extra flags are needed.
Sizing the expert cache
--ssd-cache sizes the resident routed-expert cache (auto, or an explicit
budget like 32GB). Any size is safe: the bank is clamped at startup so
prefill cannot overflow memory, and on RAM-limited machines it automatically
shrinks after prefill so decode-miss reads stay served by the OS file cache at
RAM speed (the wired bank would otherwise evict that cache and collapse decode
throughput). Startup logs show what was resolved:
ds4: --ssd-cache 48GB resolved Flash-MoE slot bank: slots=89 gpu-bank=47.65 GiB
ds4: Flash-MoE shrinking decode slot bank after full prefill: layers=43 slots 89->31
ds4: prefill: ... generation: ...
Recommended starting points:
| machine | setting |
|---|---|
| 96 GB (M3 Ultra) | --ssd-cache auto (or 16GB..32GB) |
| 128 GB (M5 Max) | --ssd-cache auto (or 32GB..48GB) |
Server and agent
An OpenAI-compatible local server and a coding-agent frontend ship in the same repo:
./ds4-server -m /path/to/DSv4-Flash-MXFP4-native-flash --ssd-cache auto --ctx 100000
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello"}]}'
./ds4-agent -m /path/to/DSv4-Flash-MXFP4-native-flash --ssd-cache auto
Measured throughput
| machine | 16k prefill (ANE i8i8) | decode @16k ctx | short-ctx decode |
|---|---|---|---|
| M5 Max 128 GB | ~316 t/s | ~4.6 t/s | ~10 t/s |
| M3 Ultra 96 GB | — | — | ~3–4 t/s |
On the M5 Max this package matches or slightly beats the Q4_K reference (315.9 vs 312.9 t/s prefill, 4.59 vs 4.47 t/s decode @16k) while keeping the experts bit-exact. M3 Ultra numbers are with the automatic decode-bank shrink; decode there is bounded by SSD/file-cache miss IO, not by compute.
Notes
- The dense GGUF here is the converted Q8_0/F16 file that stock ds4 loads
directly. The converter that produced it from the native FP8 dense export
ships in the repo as
fp4_samples/convert_native_dense_to_ds4.py. - MXFP4 expert records use the ggml split-half nibble order (
qs[j]low nibble = element j, high = element j+16); values are identical to the HF sequential-pair packing, bytes are not. - MXFP4 qualifies for the same ANE i8i8 (W8A8) prefill backend as Q4_K, plus
dedicated
mul_mm_id/mul_mv_idMetal dequant kernels for GPU paths. - Tuning knobs are documented in docs/STREAMING_KNOBS.md.
License
MIT, following the upstream DeepSeek-V4-Flash release. The ds4-ssd runtime carries its own license in the repo.
- Downloads last month
- 6
We're not able to determine the quantization variants.
Model tree for anemll/DSv4-Flash-MXFP4-native-flash
Base model
deepseek-ai/DeepSeek-V4-Flash