YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

gdubicki/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10)

Runs gdubicki/Qwen3-Coder-Next-NVFP4-GB10 (quantized by saricles) via vLLM with an OpenAI-compatible API endpoint. Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X).

Model overview

Property Value
Architecture qwen3_next (Hybrid DeltaNet linear attention + full attention + latent MoE)
Base model Qwen/Qwen3-Coder-Next
Parameters 80B total, 3B active per token (512 experts, 10 active + 1 shared)
Layers 48 (pattern: 3Γ— DeltaNet linear β†’ 1Γ— full attention, repeating β†’ 12 full attention)
Quantization NVFP4 via llmcompressor + compressed-tensors; all 512 MoE experts calibrated
Kept in BF16 lm_head, embed_tokens, linear_attn layers, mlp.gate, mlp.shared_expert_gate
KV cache FP8 (only for the 12 full-attention layers)
Model size ~45 GB (70% reduction from ~149 GB BF16)
Max context 262,144 tokens (native; tested with FP8 KV cache by saricles)
Reasoning Built-in chain-of-thought (<think> tags), ON by default

Model features

Feature Support
Tool calling βœ… Yes (--tool-call-parser qwen3_coder)
Reasoning / thinking mode βœ… Yes (ON by default, toggleable via enable_thinking)
Languages multilingual (code-focused)

Performance

Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X):

Metric Value
Throughput ~61 tok/s
Max context 262,144 tokens
KV cache concurrency 31.65Γ— at 262K tokens (DeltaNet has no KV cache)

Comparison across models on GB10:

Model Active params tok/s
Gemma-4-31B-IT-NVFP4 (dense) 31B ~7
Qwen3-32B-NVFP4 (dense) 32.8B ~11
Nemotron-3-Super-120B-A12B-NVFP4 12B ~16
Qwen3-Coder-Next-NVFP4-GB10 3B ~61
Nemotron-3-Nano-30B-A3B-NVFP4 3B ~61

Qwen3-Coder-Next is 80B total but 3B active β€” same throughput as Nemotron-3-Nano with full 262K context.

Quick start

# Required β€” model is gated on Hugging Face (accept license at gdubicki/Qwen3-Coder-Next-NVFP4-GB10 first):
export HF_TOKEN=hf_xxxx

bash start-qwen3-coder-next.sh

The script will:

  1. Stop and remove any existing qwen3-coder-next-vllm container
  2. Flush the system page cache (frees unified memory before vLLM starts)
  3. Start the container in detached mode
  4. Poll http://localhost:8000/health until the API is ready (~45 GB download on first run)

Test the API

bash test-api.sh              # localhost
bash test-api.sh 192.168.x.x  # remote host

Reasoning (chain-of-thought)

Reasoning is ON by default. Toggle per request:

# Reasoning OFF
curl -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}'

# Reasoning ON (default)
curl -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}'

Cline configuration

  1. Open Cline settings (sidebar icon β†’ gear icon)
  2. Fill in the fields:
Field Value
Provider OpenAI Compatible
Base URL http://<spark-ip>:8000/v1
Model ID gdubicki/Qwen3-Coder-Next-NVFP4-GB10
API Key dummy (any non-empty string)

Files

File Purpose
start-qwen3-coder-next.sh Full launcher: stop, cache flush, docker run, health poll
docker-run.sh Bare docker run command with comments, for reference
test-api.sh curl smoke tests: health, model list, chat completion, reasoning, code generation

How it works

The vllm/vllm-openai:cu130-nightly image includes native qwen3_next support.

Architecture: Hybrid of:

  • DeltaNet selective linear attention layers (36 of 48 layers) β€” subquadratic in sequence length, no KV cache
  • Full attention layers (12 of 48) β€” standard transformer attention with KV cache
  • Latent MoE (80B total, 3B active per token β€” same throughput profile as Nano)

Quantization is via compressed-tensors (llmcompressor) with LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1 β€” all 512 MoE experts are calibrated, not just sampled. vLLM auto-detects quantization from quantization_config in config.json; no --quantization flag needed.

Key environment variables

Variable Reason
VLLM_NVFP4_GEMM_BACKEND=marlin SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts
VLLM_TEST_FORCE_FP8_MARLIN=1 Forces FP8 Marlin path on GB10 SM12.1
VLLM_USE_FLASHINFER_MOE_FP4=0 FlashInfer MoE FP4 path not supported on GB10 SM12.1
VLLM_MARLIN_USE_ATOMIC_ADD=1 GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1

Key flags

Flag Reason
--dtype auto BF16 for non-quantized layers (DeltaNet, router gates, lm_head)
--kv-cache-dtype fp8 FP8 KV cache; applies to 12 full-attention layers only
--gpu-memory-utilization 0.90 0.90 Γ— 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky)
--max-model-len 262144 Full native context; tested by saricles with FP8 KV cache
--max-num-seqs 64 Max concurrent requests
--max-num-batched-tokens 8192 Prevents OOM on long contexts
--attention-backend flashinfer Required for FP8 KV cache + chunked prefill on GB10
--enable-prefix-caching Reuses KV cache for repeated prompt prefixes (system prompts)
--enable-chunked-prefill Reduces memory spikes during long-prompt processing
--tool-call-parser qwen3_coder OpenAI-compatible tool calling

Requirements

  • Docker with nvidia-container-toolkit
  • Image: vllm/vllm-openai:cu130-nightly
  • HF token with access to gdubicki/Qwen3-Coder-Next-NVFP4-GB10 (gated β€” accept license first)
  • Model weights cached locally (auto-downloaded on first run, 45 GB): `/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support