Instructions to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Intel/DeepSeek-V4-Flash-W4A16-AutoRound")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Intel/DeepSeek-V4-Flash-W4A16-AutoRound") model = AutoModelForCausalLM.from_pretrained("Intel/DeepSeek-V4-Flash-W4A16-AutoRound") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Intel/DeepSeek-V4-Flash-W4A16-AutoRound" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Intel/DeepSeek-V4-Flash-W4A16-AutoRound", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Intel/DeepSeek-V4-Flash-W4A16-AutoRound
- SGLang
How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Intel/DeepSeek-V4-Flash-W4A16-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Intel/DeepSeek-V4-Flash-W4A16-AutoRound", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Intel/DeepSeek-V4-Flash-W4A16-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Intel/DeepSeek-V4-Flash-W4A16-AutoRound", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with Docker Model Runner:
docker model run hf.co/Intel/DeepSeek-V4-Flash-W4A16-AutoRound
Can I deploy it with sglang at my 8*4090 ubuntu sever?
Can I deploy it with sglang at my 8*4090 ubuntu sever?
Currently, only the Transformers usage described in the model card has been verified.
To utilize SGlang/VLLM, we need to make specific changes.
Currently, only the Transformers usage described in the model card has been verified.
To utilize SGlang/VLLM, we need to make specific changes.
When can you make those changes please. Do you have any timeline for it?
I noticed that feat: implement DeepSeek-V4 model was merged into the vLLM repository 5 hours ago.
Hopefully, adding support for this won't require too much additional effort. I think you could open an issue with vLLM to see if they have any plans to support the WOQ version of DeepSeek-V4.
just try the latest VLLM main branch, got this error on 4xA100
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.attn = DeepseekV4Attention(
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] File "vllm/model_executor/models/deepseek_v4.py", line 1006, in init
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.scale_fmt = config.quantization_config["scale_fmt"]
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] KeyError: 'scale_fmt'
just try the latest VLLM main branch, got this error on 4xA100
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.attn = DeepseekV4Attention(
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] File "vllm/model_executor/models/deepseek_v4.py", line 1006, in init
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.scale_fmt = config.quantization_config["scale_fmt"]
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] KeyError: 'scale_fmt'
I have the same issue
Made patches to get it running at https://github.com/Donwulff/vllm/commit/5c7bdd6c07ab5a87f1d121ecb801d8c1e16bbff2
Works on H200, but requires about 148GB + massive KV-cache. YMMV regarding performance, depending on available tensor cores etc. this is just "get it working", not optimized kernels.
- KeyError: 'scale_fmt' at deepseek_v4.py:1006. Stub it: config.quantization_config.get("scale_fmt", "ue8m0") (or whatever matches your model card config).
- KeyError: 'layers.N.ffn.gate.qweight'. GateLinear is constructed with quant_config=None and reads self.weight directly in forward, so even the right quant_config won't help. Fix: dequant W4A16→BF16 at load and stash into gate.weight. ~3.5 MB total.
- KeyError: 'layers.N.attn.compressor.fused_wkv_wgate.qweight' (and again on attn.indexer.compressor.fused_wkv_wgate). DeepseekCompressor.fused_wkv_wgate is hardcoded unquantized; forward reads .weight.T directly. Same dequant-at-load pattern; one match on endswith("compressor.fused_wkv_wgate") covers both attn.compressor and indexer.compressor.
- KeyError: 'layers.N.attn.indexer.weights_proj.qweight'. ReplicatedLinear constructor passes quant_config=None. Forward is a normal layer(x) call, so just passing quant_config=quant_config is enough — no dequant-at-load needed.
- AttributeError: 'ColumnParallelLinear' object has no attribute 'weight' on attn.wo_a at profile_run (i.e. after a clean load). This is the architectural one. The V4 attention forward at deepseek_v4_attention.py:336 reads wo_a.weight + wo_a.weight_scale_inv and feeds them to a custom FP8 einsum kernel (deepseek_v4_fp8_einsum). The AutoRound checkpoint quantized wo_a as W4A16 GPTQ — there is no FP8 weight to read; the kernel is format-incompatible. Workaround: dequant W4A16→BF16 at load, attach as a dense wo_a.weight, and in forward guard the FP8 path with hasattr(self.wo_a, "weight_scale_inv") so it falls back to the existing reference BF16 inverse-RoPE+einsum path (rocm_inv_rope_einsum — misleadingly named, but it works on CUDA). Costs ~1–2 GB extra for the BF16 shadow weights and gives up the FP8 fast path on wo_a.
Issues 1–4 are vLLM hardcoding quant_config=None / direct .weight reads layer-by-layer — fixable upstream by propagating quant_config and using call consistently, or adding a documented "hardcoded-unquantized" hook so quant configs can dequant-at-load systematically.
Issue 5 is the real blocker. Proper W4A16 support for V4 needs either a W4A16 kernel for the wo_a einsum or a non-FP8 fallback in deepseek_v4_fp8_einsum's caller. Until that lands in vLLM (or SGLang), the model card's recommendation — use Transformers — is the only path that actually runs the checkpoint as intended. I have load working with the four patches above and the BF16 fallback, but haven't yet validated end-to-end inference quality.
is there any PR on vllm or sglang to this model?