Can I deploy it with sglang at my 8*4090 ubuntu sever?

#1
by marshal007 - opened

Can I deploy it with sglang at my 8*4090 ubuntu sever?

Intel org

Currently, only the Transformers usage described in the model card has been verified.
To utilize SGlang/VLLM, we need to make specific changes.

Currently, only the Transformers usage described in the model card has been verified.
To utilize SGlang/VLLM, we need to make specific changes.

When can you make those changes please. Do you have any timeline for it?

Intel org

I noticed that feat: implement DeepSeek-V4 model was merged into the vLLM repository 5 hours ago.
Hopefully, adding support for this won't require too much additional effort. I think you could open an issue with vLLM to see if they have any plans to support the WOQ version of DeepSeek-V4.

just try the latest VLLM main branch, got this error on 4xA100

(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.attn = DeepseekV4Attention(
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] File "vllm/model_executor/models/deepseek_v4.py", line 1006, in init
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.scale_fmt = config.quantization_config["scale_fmt"]
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] KeyError: 'scale_fmt'

just try the latest VLLM main branch, got this error on 4xA100

(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.attn = DeepseekV4Attention(
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] File "vllm/model_executor/models/deepseek_v4.py", line 1006, in init
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] self.scale_fmt = config.quantization_config["scale_fmt"]
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore pid=4291) ERROR 04-28 19:35:26 [core.py:1136] KeyError: 'scale_fmt'
I have the same issue

Made patches to get it running at https://github.com/Donwulff/vllm/commit/5c7bdd6c07ab5a87f1d121ecb801d8c1e16bbff2
Works on H200, but requires about 148GB + massive KV-cache. YMMV regarding performance, depending on available tensor cores etc. this is just "get it working", not optimized kernels.

  1. KeyError: 'scale_fmt' at deepseek_v4.py:1006. Stub it: config.quantization_config.get("scale_fmt", "ue8m0") (or whatever matches your model card config).
  2. KeyError: 'layers.N.ffn.gate.qweight'. GateLinear is constructed with quant_config=None and reads self.weight directly in forward, so even the right quant_config won't help. Fix: dequant W4A16→BF16 at load and stash into gate.weight. ~3.5 MB total.
  3. KeyError: 'layers.N.attn.compressor.fused_wkv_wgate.qweight' (and again on attn.indexer.compressor.fused_wkv_wgate). DeepseekCompressor.fused_wkv_wgate is hardcoded unquantized; forward reads .weight.T directly. Same dequant-at-load pattern; one match on endswith("compressor.fused_wkv_wgate") covers both attn.compressor and indexer.compressor.
  4. KeyError: 'layers.N.attn.indexer.weights_proj.qweight'. ReplicatedLinear constructor passes quant_config=None. Forward is a normal layer(x) call, so just passing quant_config=quant_config is enough — no dequant-at-load needed.
  5. AttributeError: 'ColumnParallelLinear' object has no attribute 'weight' on attn.wo_a at profile_run (i.e. after a clean load). This is the architectural one. The V4 attention forward at deepseek_v4_attention.py:336 reads wo_a.weight + wo_a.weight_scale_inv and feeds them to a custom FP8 einsum kernel (deepseek_v4_fp8_einsum). The AutoRound checkpoint quantized wo_a as W4A16 GPTQ — there is no FP8 weight to read; the kernel is format-incompatible. Workaround: dequant W4A16→BF16 at load, attach as a dense wo_a.weight, and in forward guard the FP8 path with hasattr(self.wo_a, "weight_scale_inv") so it falls back to the existing reference BF16 inverse-RoPE+einsum path (rocm_inv_rope_einsum — misleadingly named, but it works on CUDA). Costs ~1–2 GB extra for the BF16 shadow weights and gives up the FP8 fast path on wo_a.

Issues 1–4 are vLLM hardcoding quant_config=None / direct .weight reads layer-by-layer — fixable upstream by propagating quant_config and using call consistently, or adding a documented "hardcoded-unquantized" hook so quant configs can dequant-at-load systematically.

Issue 5 is the real blocker. Proper W4A16 support for V4 needs either a W4A16 kernel for the wo_a einsum or a non-FP8 fallback in deepseek_v4_fp8_einsum's caller. Until that lands in vLLM (or SGLang), the model card's recommendation — use Transformers — is the only path that actually runs the checkpoint as intended. I have load working with the four patches above and the BF16 fallback, but haven't yet validated end-to-end inference quality.

is there any PR on vllm or sglang to this model?

Sign up or log in to comment