Instructions to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Intel/DeepSeek-V4-Flash-W4A16-AutoRound")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Intel/DeepSeek-V4-Flash-W4A16-AutoRound")
model = AutoModelForCausalLM.from_pretrained("Intel/DeepSeek-V4-Flash-W4A16-AutoRound")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Intel/DeepSeek-V4-Flash-W4A16-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/DeepSeek-V4-Flash-W4A16-AutoRound",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Intel/DeepSeek-V4-Flash-W4A16-AutoRound

SGLang

How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Intel/DeepSeek-V4-Flash-W4A16-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/DeepSeek-V4-Flash-W4A16-AutoRound",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Intel/DeepSeek-V4-Flash-W4A16-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/DeepSeek-V4-Flash-W4A16-AutoRound",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with Docker Model Runner:
```
docker model run hf.co/Intel/DeepSeek-V4-Flash-W4A16-AutoRound
```

Can I deploy it with sglang at my 8*4090 ubuntu sever?

by marshal007 - opened 29 days ago

Discussion

marshal007

29 days ago

Can I deploy it with sglang at my 8*4090 ubuntu sever?

xinhe

Intel org 29 days ago

Currently, only the Transformers usage described in the model card has been verified.
To utilize SGlang/VLLM, we need to make specific changes.

mtcl

28 days ago

Currently, only the Transformers usage described in the model card has been verified.
To utilize SGlang/VLLM, we need to make specific changes.

When can you make those changes please. Do you have any timeline for it?

xinhe

Intel org 28 days ago

I noticed that feat: implement DeepSeek-V4 model was merged into the vLLM repository 5 hours ago.
Hopefully, adding support for this won't require too much additional effort. I think you could open an issue with vLLM to see if they have any plans to support the WOQ version of DeepSeek-V4.

JC1DA

28 days ago

•

edited 28 days ago

just try the latest VLLM main branch, got this error on 4xA100

(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.attn = DeepseekV4Attention(
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] File "vllm/model_executor/models/deepseek_v4.py", line 1006, in init
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.scale_fmt = config.quantization_config["scale_fmt"]
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] KeyError: 'scale_fmt'

jerryliujiawei

27 days ago

just try the latest VLLM main branch, got this error on 4xA100

(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.attn = DeepseekV4Attention(
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] File "vllm/model_executor/models/deepseek_v4.py", line 1006, in init
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.scale_fmt = config.quantization_config["scale_fmt"]
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] KeyError: 'scale_fmt'
I have the same issue

donwulff

16 days ago

Made patches to get it running at https://github.com/Donwulff/vllm/commit/5c7bdd6c07ab5a87f1d121ecb801d8c1e16bbff2
Works on H200, but requires about 148GB + massive KV-cache. YMMV regarding performance, depending on available tensor cores etc. this is just "get it working", not optimized kernels.

KeyError: 'scale_fmt' at deepseek_v4.py:1006. Stub it: config.quantization_config.get("scale_fmt", "ue8m0") (or whatever matches your model card config).
KeyError: 'layers.N.ffn.gate.qweight'. GateLinear is constructed with quant_config=None and reads self.weight directly in forward, so even the right quant_config won't help. Fix: dequant W4A16→BF16 at load and stash into gate.weight. ~3.5 MB total.
KeyError: 'layers.N.attn.compressor.fused_wkv_wgate.qweight' (and again on attn.indexer.compressor.fused_wkv_wgate). DeepseekCompressor.fused_wkv_wgate is hardcoded unquantized; forward reads .weight.T directly. Same dequant-at-load pattern; one match on endswith("compressor.fused_wkv_wgate") covers both attn.compressor and indexer.compressor.
KeyError: 'layers.N.attn.indexer.weights_proj.qweight'. ReplicatedLinear constructor passes quant_config=None. Forward is a normal layer(x) call, so just passing quant_config=quant_config is enough — no dequant-at-load needed.
AttributeError: 'ColumnParallelLinear' object has no attribute 'weight' on attn.wo_a at profile_run (i.e. after a clean load). This is the architectural one. The V4 attention forward at deepseek_v4_attention.py:336 reads wo_a.weight + wo_a.weight_scale_inv and feeds them to a custom FP8 einsum kernel (deepseek_v4_fp8_einsum). The AutoRound checkpoint quantized wo_a as W4A16 GPTQ — there is no FP8 weight to read; the kernel is format-incompatible. Workaround: dequant W4A16→BF16 at load, attach as a dense wo_a.weight, and in forward guard the FP8 path with hasattr(self.wo_a, "weight_scale_inv") so it falls back to the existing reference BF16 inverse-RoPE+einsum path (rocm_inv_rope_einsum — misleadingly named, but it works on CUDA). Costs ~1–2 GB extra for the BF16 shadow weights and gives up the FP8 fast path on wo_a.

Issues 1–4 are vLLM hardcoding quant_config=None / direct .weight reads layer-by-layer — fixable upstream by propagating quant_config and using call consistently, or adding a documented "hardcoded-unquantized" hook so quant configs can dequant-at-load systematically.

Issue 5 is the real blocker. Proper W4A16 support for V4 needs either a W4A16 kernel for the wo_a einsum or a non-FP8 fallback in deepseek_v4_fp8_einsum's caller. Until that lands in vLLM (or SGLang), the model card's recommendation — use Transformers — is the only path that actually runs the checkpoint as intended. I have load working with the four patches above and the BF16 fallback, but haven't yet validated end-to-end inference quality.

celsowm

3 days ago

is there any PR on vllm or sglang to this model?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment