Instructions to use GadflyII/Qwen3-Coder-Next-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GadflyII/Qwen3-Coder-Next-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="GadflyII/Qwen3-Coder-Next-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GadflyII/Qwen3-Coder-Next-NVFP4")
model = AutoModelForCausalLM.from_pretrained("GadflyII/Qwen3-Coder-Next-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use GadflyII/Qwen3-Coder-Next-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GadflyII/Qwen3-Coder-Next-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/Qwen3-Coder-Next-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GadflyII/Qwen3-Coder-Next-NVFP4

SGLang

How to use GadflyII/Qwen3-Coder-Next-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "GadflyII/Qwen3-Coder-Next-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/Qwen3-Coder-Next-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "GadflyII/Qwen3-Coder-Next-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/Qwen3-Coder-Next-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use GadflyII/Qwen3-Coder-Next-NVFP4 with Docker Model Runner:
```
docker model run hf.co/GadflyII/Qwen3-Coder-Next-NVFP4
```

Why Your NVFP4 Model Is Slower Than FP8 on the GB10 (NVIDIA Spark) — And How to Fix It

by scottgl - opened Feb 23

Discussion

scottgl

Feb 23

•

edited Feb 23

Hi, I wanted to share some findings from running your Qwen3-Coder-Next-NVFP4 model on the NVIDIA GB10 (NVIDIA
Spark) — an SM 12.1 Blackwell chip with 128 GB of unified memory but only ~221 GB/s memory bandwidth
(integrated GPU, not HBM like an H100/A100).

TL;DR: Your quantization is correct and well-done. The performance issue on GB10 is not a mistake in the
quantization itself — it's a consequence of which layers you put in the ignore list, which is a totally
reasonable choice for the data-center GPUs you targeted. But those same ignored layers become the single
largest bottleneck on GB10 due to its much lower memory bandwidth. The result is something counterintuitive:
your NVFP4 model runs at ~34 tok/sec while the official Qwen/Qwen3-Coder-Next-FP8 runs at ~43 tok/sec on this
hardware. NVFP4 should be the faster format, and it will be once the right layer is included.

The culprit: in_proj_qkvz is in your ignore list but not in the FP8 model's

Your NVFP4 ignore list excludes all linear_attn.* layers, which includes in_proj_qkvz:

ignore = [
"lm_head",
"re:.*mlp.gate$",
"re:.*mlp.shared_expert_gate$",
"re:.linear_attn.", # ← covers in_proj_qkvz, in_proj_ba, conv1d, out_proj
]

The official FP8 model (Qwen/Qwen3-Coder-Next-FP8) takes a more surgical approach — it excludes conv1d,
in_proj_ba, gates, and lm_head, but leaves in_proj_qkvz in the quantized set:

modules_to_not_convert = [
"lm_head",
"model.embed_tokens",
"re:.*linear_attn.conv1d",
"re:.*linear_attn.in_proj_ba",
"re:.*mlp.gate",
"re:.*mlp.shared_expert_gate",
# in_proj_qkvz is NOT listed — it gets quantized
]

That one difference — in_proj_qkvz quantized vs BF16 — is what explains the entire performance gap on GB10.

Why in_proj_qkvz hurts so much on GB10 specifically

On an H100 (3.35 TB/s HBM), 36 × BF16 in_proj_qkvz GEMMs at decode batch size 1 cost roughly ~0.9 ms total —
completely negligible. Nobody would notice.

On the GB10 (221 GB/s integrated), the same 36 GEMMs cost ~10.9 ms — because at M=1 these are 100%
memory-bandwidth-bound, and GB10 has ~15× less bandwidth than H100. This turns what is a rounding error on a
server GPU into the single largest component of the entire decode step.

Here's the full profiled breakdown (~29 ms per step at 34 tok/sec):

┌────────────────────────────────────────────────┬─────────┬───────────┐
│ Component │ Time │ % of step │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ GDN in_proj_qkvz × 36 (BF16, from ignore list) │ 10.9 ms │ 37.6% │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ MoE CUTLASS FP4 × 48 │ 7.1 ms │ 24.4% │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ Dense FP4 GEMMs (QKV, O, shared expert) × 144 │ ~7.5 ms │ ~25.9% │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ lm_head (BF16) │ 3.55 ms │ 12.2% │
├────────────────────────────────────────────────┼─────────┼───────────┤
│ GDN recurrent, attention, RMSNorm, routing │ ~2.4 ms │ ~8.3% │
└────────────────────────────────────────────────┴─────────┴───────────┘

The FP8 model quantizes in_proj_qkvz, so it doesn't pay this cost. That's the gap.

Why NVFP4 should beat FP8, but currently doesn't

NVFP4 gives 4× weight compression vs BF16; FP8 gives only 2×. At M=1 on a bandwidth-constrained GPU, that
should translate almost directly to throughput:

┌────────────────────────────────────────┬─────────────────────────┬───────────────────┐
│ Format │ in_proj_qkvz cost (×36) │ Projected tok/sec │
├────────────────────────────────────────┼─────────────────────────┼───────────────────┤
│ BF16 (current NVFP4 checkpoint) │ 10.9 ms │ ~34 tok/sec │
├────────────────────────────────────────┼─────────────────────────┼───────────────────┤
│ FP8 (official Qwen model) │ ~8.2 ms │ ~43 tok/sec │
├────────────────────────────────────────┼─────────────────────────┼───────────────────┤
│ NVFP4 (if in_proj_qkvz were quantized) │ ~3.0 ms │ ~52 tok/sec │
└────────────────────────────────────────┴─────────────────────────┴───────────────────┘

What a GB10-optimized re-quantization would look like

The fix is simply removing in_proj_qkvz from the ignore list and letting llmcompressor calibrate it like the
rest of the model. The measured weight SNR for in_proj_qkvz comes out to ~20.49 dB / cosine similarity ~0.9955
— identical to the calibrated FP4 layers already in your checkpoint. The weights follow the same distribution
(σ ≈ 0.02, max ≈ 0.4–0.6), so quantization error is no worse than what the rest of the model already runs at.

The precision caveat: in_proj_qkvz feeds the DeltaNet recurrent state update, so errors accumulate over
context. At ~0.9955 cosine similarity the directions are extremely well-preserved; risk is low for short
context (<4K tokens) and worth testing at very long context (>32K). This is presumably why you ignored it
originally, and the caution is reasonable — but the FP8 model makes the same trade and ships with it
quantized.

As a bonus, lm_head (BF16 in both your model and the FP8 model) could also be quantized to NVFP4 for an
additional ~~2.7 ms savings (~~3 tok/sec), since it's a pure projection with no recurrent risk.

Why the other ignored layers are fine to leave alone

conv1d, in_proj_ba, mlp.gate, and mlp.shared_expert_gate are all small matrices. The FP4 kernel has a fixed
dispatch overhead of ~78 µs/call regardless of matrix size — which exceeds the entire BF16 cost for these
layers. FP4 is actually slower for them everywhere, including GB10. The FP8 model correctly leaves them in
BF16 too.

One other thing worth flagging — scale_fmt

When loading in sglang on GB10 you'll see:

DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0.
This might cause accuracy degradation on Blackwell.

This is because the checkpoint's weight_scale tensors use float8_e4m3fn rather than ue8m0 (the unsigned
exponent-only FP8 format used by Blackwell's DeepGEMM kernel). Non-blocking and doesn't affect the
cutlass_moe_fp4 path that actually runs — but worth knowing about. Not fixable without re-quantizing.

Summary

Your quantization is accurate and well-calibrated — −1.63% MMLU-Pro for W4A4 FP4 is excellent. The GB10
performance gap vs FP8 comes down to one layer group: in_proj_qkvz is in your ignore list but not in the FP8
model's. On server GPUs that difference is invisible; on GB10's 221 GB/s bandwidth it costs 10.9 ms per step.
Removing in_proj_qkvz from the ignore list in a re-quantization should push throughput to ~52 tok/sec — well
past both the current 34 tok/sec and the FP8 model's 43 tok/sec.

Thanks for publishing this model — it was a solid starting point to work from.

Best regards,
Scott Glover

GadflyII

Owner Mar 4

Thanks, I will have a lot at this.

makedo

Mar 28

For what it's worth, this was a great writeup. Collegial, thoughtful and informative. 🤘

nagashik

Apr 4

scottgl very good comment. could you please send the command you start this model on you nvidia spark.

GadflyII

Owner Apr 15

So I did try this, and it corrupted the backbone pretty badly and accuracy tanked.

GadflyII changed discussion status to closed Apr 15

teason2024

Apr 16

to GadflyII point, I've tried gb10 optimized models for minimax-m2.7/same qwen3-coder-next and was not happy about quality, seems like quantization of these layers tanks it.

scottgl

Apr 20

•

edited Apr 20

So I did try this, and it corrupted the backbone pretty badly and accuracy tanked.

Unfortunately, in many of these models lm_head is the bottleneck, because of the limited memory bandwidth of the GB10 spark. What I've been doing, is post quantizing certain layers when the model loads:
https://github.com/scottgl9/sglang-spark-gb10-optimizations

Quantizing lm_head to FP8 seems to have minimal impact, so that would be my recommendation to get better inference speed on the Spark. I've also quantized other layers to FP8, and in certain cases to NVFP4, but I would be careful which layers you quantize to NVFP4. There are also quite a few patches you need for the GB10, because currently marlin is the best performing kernel with these patches.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment