Why Your NVFP4 Model Is Slower Than FP8 on the GB10 (NVIDIA Spark) — And How to Fix It
Hi, I wanted to share some findings from running your Qwen3-Coder-Next-NVFP4 model on the NVIDIA GB10 (NVIDIA
Spark) — an SM 12.1 Blackwell chip with 128 GB of unified memory but only ~221 GB/s memory bandwidth
(integrated GPU, not HBM like an H100/A100).
TL;DR: Your quantization is correct and well-done. The performance issue on GB10 is not a mistake in the
quantization itself — it's a consequence of which layers you put in the ignore list, which is a totally
reasonable choice for the data-center GPUs you targeted. But those same ignored layers become the single
largest bottleneck on GB10 due to its much lower memory bandwidth. The result is something counterintuitive:
your NVFP4 model runs at ~34 tok/sec while the official Qwen/Qwen3-Coder-Next-FP8 runs at ~43 tok/sec on this
hardware. NVFP4 should be the faster format, and it will be once the right layer is included.
The culprit: in_proj_qkvz is in your ignore list but not in the FP8 model's
Your NVFP4 ignore list excludes all linear_attn.* layers, which includes in_proj_qkvz:
ignore = [
"lm_head",
"re:.*mlp.gate$",
"re:.*mlp.shared_expert_gate$",
"re:.linear_attn.", # ← covers in_proj_qkvz, in_proj_ba, conv1d, out_proj
]
The official FP8 model (Qwen/Qwen3-Coder-Next-FP8) takes a more surgical approach — it excludes conv1d,
in_proj_ba, gates, and lm_head, but leaves in_proj_qkvz in the quantized set:
modules_to_not_convert = [
"lm_head",
"model.embed_tokens",
"re:.*linear_attn.conv1d",
"re:.*linear_attn.in_proj_ba",
"re:.*mlp.gate",
"re:.*mlp.shared_expert_gate",
# in_proj_qkvz is NOT listed — it gets quantized
]
That one difference — in_proj_qkvz quantized vs BF16 — is what explains the entire performance gap on GB10.
Why in_proj_qkvz hurts so much on GB10 specifically
On an H100 (3.35 TB/s HBM), 36 × BF16 in_proj_qkvz GEMMs at decode batch size 1 cost roughly ~0.9 ms total —
completely negligible. Nobody would notice.
On the GB10 (221 GB/s integrated), the same 36 GEMMs cost ~10.9 ms — because at M=1 these are 100%
memory-bandwidth-bound, and GB10 has ~15× less bandwidth than H100. This turns what is a rounding error on a
server GPU into the single largest component of the entire decode step.
Here's the full profiled breakdown (~29 ms per step at 34 tok/sec):
┌────────────────────────────────────────────────┬─────────┬───────────┐
│ Component │ Time │ % of step │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ GDN in_proj_qkvz × 36 (BF16, from ignore list) │ 10.9 ms │ 37.6% │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ MoE CUTLASS FP4 × 48 │ 7.1 ms │ 24.4% │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ Dense FP4 GEMMs (QKV, O, shared expert) × 144 │ ~7.5 ms │ ~25.9% │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ lm_head (BF16) │ 3.55 ms │ 12.2% │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ GDN recurrent, attention, RMSNorm, routing │ ~2.4 ms │ ~8.3% │
└────────────────────────────────────────────────┴─────────┴───────────┘
The FP8 model quantizes in_proj_qkvz, so it doesn't pay this cost. That's the gap.
Why NVFP4 should beat FP8, but currently doesn't
NVFP4 gives 4× weight compression vs BF16; FP8 gives only 2×. At M=1 on a bandwidth-constrained GPU, that
should translate almost directly to throughput:
┌────────────────────────────────────────┬─────────────────────────┬───────────────────┐
│ Format │ in_proj_qkvz cost (×36) │ Projected tok/sec │
├────────────────────────────────────────┼─────────────────────────┼───────────────────┤
│ BF16 (current NVFP4 checkpoint) │ 10.9 ms │ ~34 tok/sec │
├────────────────────────────────────────┼─────────────────────────┼───────────────────┤
│ FP8 (official Qwen model) │ ~8.2 ms │ ~43 tok/sec │
├────────────────────────────────────────┼─────────────────────────┼───────────────────┤
│ NVFP4 (if in_proj_qkvz were quantized) │ ~3.0 ms │ ~52 tok/sec │
└────────────────────────────────────────┴─────────────────────────┴───────────────────┘
What a GB10-optimized re-quantization would look like
The fix is simply removing in_proj_qkvz from the ignore list and letting llmcompressor calibrate it like the
rest of the model. The measured weight SNR for in_proj_qkvz comes out to ~20.49 dB / cosine similarity ~0.9955
— identical to the calibrated FP4 layers already in your checkpoint. The weights follow the same distribution
(σ ≈ 0.02, max ≈ 0.4–0.6), so quantization error is no worse than what the rest of the model already runs at.
The precision caveat: in_proj_qkvz feeds the DeltaNet recurrent state update, so errors accumulate over
context. At ~0.9955 cosine similarity the directions are extremely well-preserved; risk is low for short
context (<4K tokens) and worth testing at very long context (>32K). This is presumably why you ignored it
originally, and the caution is reasonable — but the FP8 model makes the same trade and ships with it
quantized.
As a bonus, lm_head (BF16 in both your model and the FP8 model) could also be quantized to NVFP4 for an
additional 2.7 ms savings (3 tok/sec), since it's a pure projection with no recurrent risk.
Why the other ignored layers are fine to leave alone
conv1d, in_proj_ba, mlp.gate, and mlp.shared_expert_gate are all small matrices. The FP4 kernel has a fixed
dispatch overhead of ~78 µs/call regardless of matrix size — which exceeds the entire BF16 cost for these
layers. FP4 is actually slower for them everywhere, including GB10. The FP8 model correctly leaves them in
BF16 too.
One other thing worth flagging — scale_fmt
When loading in sglang on GB10 you'll see:
DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0.
This might cause accuracy degradation on Blackwell.
This is because the checkpoint's weight_scale tensors use float8_e4m3fn rather than ue8m0 (the unsigned
exponent-only FP8 format used by Blackwell's DeepGEMM kernel). Non-blocking and doesn't affect the
cutlass_moe_fp4 path that actually runs — but worth knowing about. Not fixable without re-quantizing.
Summary
Your quantization is accurate and well-calibrated — −1.63% MMLU-Pro for W4A4 FP4 is excellent. The GB10
performance gap vs FP8 comes down to one layer group: in_proj_qkvz is in your ignore list but not in the FP8
model's. On server GPUs that difference is invisible; on GB10's 221 GB/s bandwidth it costs 10.9 ms per step.
Removing in_proj_qkvz from the ignore list in a re-quantization should push throughput to ~52 tok/sec — well
past both the current 34 tok/sec and the FP8 model's 43 tok/sec.
Thanks for publishing this model — it was a solid starting point to work from.
Best regards,
Scott Glover
Thanks, I will have a lot at this.