Instructions to use anemll/DSv4-Flash-MXFP4-native-flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use anemll/DSv4-Flash-MXFP4-native-flash with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="anemll/DSv4-Flash-MXFP4-native-flash",
	filename="dense/model-dense.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use anemll/DSv4-Flash-MXFP4-native-flash with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf anemll/DSv4-Flash-MXFP4-native-flash
# Run inference directly in the terminal:
llama cli -hf anemll/DSv4-Flash-MXFP4-native-flash

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf anemll/DSv4-Flash-MXFP4-native-flash
# Run inference directly in the terminal:
llama cli -hf anemll/DSv4-Flash-MXFP4-native-flash

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf anemll/DSv4-Flash-MXFP4-native-flash
# Run inference directly in the terminal:
./llama-cli -hf anemll/DSv4-Flash-MXFP4-native-flash

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf anemll/DSv4-Flash-MXFP4-native-flash
# Run inference directly in the terminal:
./build/bin/llama-cli -hf anemll/DSv4-Flash-MXFP4-native-flash

Use Docker

docker model run hf.co/anemll/DSv4-Flash-MXFP4-native-flash

LM Studio
Jan
Ollama
How to use anemll/DSv4-Flash-MXFP4-native-flash with Ollama:
```
ollama run hf.co/anemll/DSv4-Flash-MXFP4-native-flash
```

Unsloth Studio

How to use anemll/DSv4-Flash-MXFP4-native-flash with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for anemll/DSv4-Flash-MXFP4-native-flash to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for anemll/DSv4-Flash-MXFP4-native-flash to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for anemll/DSv4-Flash-MXFP4-native-flash to start chatting

How to use anemll/DSv4-Flash-MXFP4-native-flash with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf anemll/DSv4-Flash-MXFP4-native-flash

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "anemll/DSv4-Flash-MXFP4-native-flash"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use anemll/DSv4-Flash-MXFP4-native-flash with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf anemll/DSv4-Flash-MXFP4-native-flash

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default anemll/DSv4-Flash-MXFP4-native-flash

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use anemll/DSv4-Flash-MXFP4-native-flash with Docker Model Runner:
```
docker model run hf.co/anemll/DSv4-Flash-MXFP4-native-flash
```

Lemonade

How to use anemll/DSv4-Flash-MXFP4-native-flash with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull anemll/DSv4-Flash-MXFP4-native-flash

Run and chat with the model

lemonade run user.DSv4-Flash-MXFP4-native-flash-{{QUANT_TAG}}

List all available models

lemonade list

DSv4-Flash-MXFP4-native-flash

DeepSeek V4 Flash packaged for SSD-streamed inference on Apple Silicon with ds4-ssd. The routed experts are the native MXFP4 weights, bit-exact with the original DeepSeek release — no requantization — stored in a layer-major expert sidecar that ds4 pages from SSD through a Metal slot-bank cache, so the 156 GB model runs on machines that cannot hold it resident (validated on M3 Ultra 96 GB and M5 Max 128 GB).

path	what it is	size
`manifest.json`	sidecar manifest (layout `layer_major_expert`, 43 layers, 256 experts)	—
`layer_000.bin … layer_042.bin`	routed-expert records, native MXFP4 (e8m0 scale + 32×e2m1, ggml `block_mxfp4` nibble order), bit-exact with the HF safetensors	43 × 3.42 GB
`dense/model-dense.gguf`	dense/shared tensors (attention, shared expert, embeddings, head) as Q8_0/F16 GGUF	8.8 GB
`dense/flashmoe-package.json`	package descriptor	—

Fidelity vs the original HF model: expert bytes are bit-exact; the only loss in the chain is the dense FP8→Q8_0 re-encode at 0.55% relRMS (~4–5× below the FP8 grid's own step). Embeddings are exact. Graded QA spot-checks (GPQA/SuperGPQA via ds4-eval) score identically to the Q4_K reference package, and the OpenAI-API server smoke passes.

Quickstart (SSD streaming)

Requirements: Apple Silicon Mac with 96 GB+ unified memory, macOS, ~156 GB free on a fast SSD, Xcode command line tools, and the Hugging Face CLI.

# build the runtime
git clone https://github.com/Anemll/ds4-ssd.git
cd ds4-ssd
make

# download this package (~156 GB, resumable; rerun to resume)
./download_model.sh mxfp4

# run with SSD streaming
./ds4 -m models/DSv4-Flash-MXFP4-native-flash --ssd-cache auto -p "What is the Apple Neural Engine?"

Or point -m at any copy of the package directory:

./ds4 -m /path/to/DSv4-Flash-MXFP4-native-flash --ssd-cache 32GB --ctx 32768 -p "Hello"

ds4 auto-detects the sidecar from manifest.json + dense/model-dense.gguf; no extra flags are needed.

Sizing the expert cache

--ssd-cache sizes the resident routed-expert cache (auto, or an explicit budget like 32GB). Any size is safe: the bank is clamped at startup so prefill cannot overflow memory, and on RAM-limited machines it automatically shrinks after prefill so decode-miss reads stay served by the OS file cache at RAM speed (the wired bank would otherwise evict that cache and collapse decode throughput). Startup logs show what was resolved:

ds4: --ssd-cache 48GB resolved Flash-MoE slot bank: slots=89 gpu-bank=47.65 GiB
ds4: Flash-MoE shrinking decode slot bank after full prefill: layers=43 slots 89->31
ds4: prefill: ... generation: ...

Recommended starting points:

machine	setting
96 GB (M3 Ultra)	`--ssd-cache auto` (or `16GB..32GB`)
128 GB (M5 Max)	`--ssd-cache auto` (or `32GB..48GB`)

Server and agent

An OpenAI-compatible local server and a coding-agent frontend ship in the same repo:

./ds4-server -m /path/to/DSv4-Flash-MXFP4-native-flash --ssd-cache auto --ctx 100000

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}]}'

./ds4-agent -m /path/to/DSv4-Flash-MXFP4-native-flash --ssd-cache auto

Measured throughput

machine	16k prefill (ANE i8i8)	decode @16k ctx	short-ctx decode
M5 Max 128 GB	~316 t/s	~4.6 t/s	~10 t/s
M3 Ultra 96 GB	—	—	~3–4 t/s

On the M5 Max this package matches or slightly beats the Q4_K reference (315.9 vs 312.9 t/s prefill, 4.59 vs 4.47 t/s decode @16k) while keeping the experts bit-exact. M3 Ultra numbers are with the automatic decode-bank shrink; decode there is bounded by SSD/file-cache miss IO, not by compute.

Notes

The dense GGUF here is the converted Q8_0/F16 file that stock ds4 loads directly. The converter that produced it from the native FP8 dense export ships in the repo as fp4_samples/convert_native_dense_to_ds4.py.
MXFP4 expert records use the ggml split-half nibble order (qs[j] low nibble = element j, high = element j+16); values are identical to the HF sequential-pair packing, bytes are not.
MXFP4 qualifies for the same ANE i8i8 (W8A8) prefill backend as Q4_K, plus dedicated mul_mm_id / mul_mv_id Metal dequant kernels for GPU paths.
Tuning knobs are documented in docs/STREAMING_KNOBS.md.

License

MIT, following the upstream DeepSeek-V4-Flash release. The ds4-ssd runtime carries its own license in the repo.

Downloads last month: 6

GGUF

Model size

7B params

Architecture

deepseek4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anemll/DSv4-Flash-MXFP4-native-flash

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(82)

this model

anemll
/

DSv4-Flash-MXFP4-native-flash