You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen3-Coder-Next-NVFP4-GB10

NVFP4 quantization of Qwen/Qwen3-Coder-Next for NVIDIA DGX Spark (GB10).

Qwen3-Coder-Next is a 79.7B-parameter MoE coding model (512 experts, 10 active per token) with hybrid DeltaNet+attention architecture. This quantization uses a GB10-tuned ignore list that quantizes more aggressively than standard NVFP4 configurations.

Model Details

Base Model Qwen/Qwen3-Coder-Next
Architecture Qwen3NextForCausalLM (Hybrid MoE โ€” DeltaNet + attention)
Total Parameters 79.7B
Active Parameters ~3B per token (512 experts, 10 active)
Quantization NVFP4 (4-bit floating point) via LLM Compressor
Format compressed-tensors (safetensors), 10 shards
Size on Disk 45.9 GB
Context Length 262,144 tokens (262K)
License Apache 2.0

Quantization Details

  • Method: Post-training quantization via LLM Compressor
  • Calibration Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
  • Calibration Samples: 64
  • Max Sequence Length: 2048 tokens
  • Environment: LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1

Ignore List (layers kept in BF16)

lm_head
model.embed_tokens
re:.*linear_attn.conv1d
re:.*linear_attn.in_proj_ba
re:.*mlp.gate$
re:.*mlp.shared_expert_gate$

Everything else โ€” including in_proj_qkvz โ€” is quantized to FP4. On GB10's 221 GB/s bandwidth, the bandwidth savings from quantizing these layers outweigh the FP4 kernel dispatch overhead.

Performance (Single NVIDIA DGX Spark โ€” GB10, 128 GB)

Benchmarked with llama-benchy v0.3.3, 3 runs per config.

PP TG Prefill (tok/s) Decode (tok/s) TTFT (ms)
512 128 2,024 62.0 285
512 256 2,528 62.1 206
1024 128 3,261 60.6 319
1024 256 3,350 61.8 309
4096 128 3,987 61.1 1,031
4096 256 3,971 61.1 1,035
Metric Value
Model memory 42.7 GiB
KV cache 61.7 GiB (1,346,432 tokens)
Concurrent sessions @ 262K ~5
Concurrent sessions @ 65K ~20

The hybrid DeltaNet+attention architecture means decode speed is constant regardless of context length โ€” DeltaNet layers don't use KV cache.

Running on a Single DGX Spark

Docker image: avarok/dgx-vllm-nvfp4-kernel:v23 (vLLM 0.16.0-rc2, CUDA 13.0, SM 12.1)

Download the model:

huggingface-cli download saricles/Qwen3-Coder-Next-NVFP4-GB10 \
  --local-dir /opt/huggingface/models/Qwen3-Coder-Next-NVFP4-GB10

Launch:

docker run -d --name coder-next --gpus all --ipc=host --shm-size 32g \
  -v /opt/huggingface/models/Qwen3-Coder-Next-NVFP4-GB10:/models/Qwen3-Coder-Next-NVFP4-GB10 \
  -p 8000:8000 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -e MODEL=/models/Qwen3-Coder-Next-NVFP4-GB10 \
  -e PORT=8000 \
  -e MAX_MODEL_LEN=262144 \
  -e GPU_MEMORY_UTIL=0.90 \
  -e "VLLM_EXTRA_ARGS=--kv-cache-dtype fp8 --attention-backend flashinfer --enable-prefix-caching --enable-chunked-prefill --max-num-batched-tokens 8192 --max-num-seqs 64 --enable-auto-tool-choice --tool-call-parser qwen3_coder" \
  avarok/dgx-vllm-nvfp4-kernel:v23

Test it:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-Coder-Next-NVFP4-GB10",
    "messages": [{"role": "user", "content": "Write a Python function to find the longest common subsequence"}],
    "temperature": 0.7,
    "max_tokens": 2048
  }'

Notes

  • At 42.7 GiB model weight + 0.90 GPU util, you get ~62 GiB for KV cache โ€” enough for 5 concurrent 262K sessions.
  • gpu_memory_utilization=0.93 works but leaves very little system headroom. 0.90 is safer.
  • Decode speed is constant across context lengths thanks to the DeltaNet hybrid architecture.
  • Marlin backend is 15% faster than VLLM_CUTLASS for this model's 512 experts.

Target Hardware

Quantized and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell GPUs with NVFP4 support.

Acknowledgments

Downloads last month
249
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for saricles/Qwen3-Coder-Next-NVFP4-GB10

Quantized
(79)
this model