gdubicki/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10)

Runs gdubicki/Qwen3-Coder-Next-NVFP4-GB10 (quantized by saricles) via vLLM with an OpenAI-compatible API endpoint. Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X).

Model overview

Property	Value
Architecture	`qwen3_next` (Hybrid DeltaNet linear attention + full attention + latent MoE)
Base model	`Qwen/Qwen3-Coder-Next`
Parameters	80B total, 3B active per token (512 experts, 10 active + 1 shared)
Layers	48 (pattern: 3× DeltaNet linear → 1× full attention, repeating → 12 full attention)
Quantization	NVFP4 via `llmcompressor` + `compressed-tensors`; all 512 MoE experts calibrated
Kept in BF16	`lm_head`, `embed_tokens`, `linear_attn` layers, `mlp.gate`, `mlp.shared_expert_gate`
KV cache	FP8 (only for the 12 full-attention layers)
Model size	~45 GB (70% reduction from ~149 GB BF16)
Max context	262,144 tokens (native; tested with FP8 KV cache by saricles)
Reasoning	Built-in chain-of-thought (`<think>` tags), ON by default

Model features

Feature	Support
Tool calling	✅ Yes (`--tool-call-parser qwen3_coder`)
Reasoning / thinking mode	✅ Yes (ON by default, toggleable via `enable_thinking`)
Languages	multilingual (code-focused)

Performance

Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X):

Metric	Value
Throughput	~61 tok/s
Max context	262,144 tokens
KV cache concurrency	31.65× at 262K tokens (DeltaNet has no KV cache)

Comparison across models on GB10:

Model	Active params	tok/s
Gemma-4-31B-IT-NVFP4 (dense)	31B	~7
Qwen3-32B-NVFP4 (dense)	32.8B	~11
Nemotron-3-Super-120B-A12B-NVFP4	12B	~16
Qwen3-Coder-Next-NVFP4-GB10	3B	~61
Nemotron-3-Nano-30B-A3B-NVFP4	3B	~61

Qwen3-Coder-Next is 80B total but 3B active — same throughput as Nemotron-3-Nano with full 262K context.

Quick start

# Required — model is gated on Hugging Face (accept license at gdubicki/Qwen3-Coder-Next-NVFP4-GB10 first):
export HF_TOKEN=hf_xxxx

bash start-qwen3-coder-next.sh

The script will:

Stop and remove any existing qwen3-coder-next-vllm container
Flush the system page cache (frees unified memory before vLLM starts)
Start the container in detached mode
Poll http://localhost:8000/health until the API is ready (~45 GB download on first run)

Test the API

bash test-api.sh              # localhost
bash test-api.sh 192.168.x.x  # remote host

Reasoning (chain-of-thought)

Reasoning is ON by default. Toggle per request:

# Reasoning OFF
curl -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}'

# Reasoning ON (default)
curl -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}'

Cline configuration

Open Cline settings (sidebar icon → gear icon)
Fill in the fields:

Field	Value
Provider	OpenAI Compatible
Base URL	`http://<spark-ip>:8000/v1`
Model ID	`gdubicki/Qwen3-Coder-Next-NVFP4-GB10`
API Key	`dummy` (any non-empty string)

Files

File	Purpose
`start-qwen3-coder-next.sh`	Full launcher: stop, cache flush, docker run, health poll
`docker-run.sh`	Bare `docker run` command with comments, for reference
`test-api.sh`	curl smoke tests: health, model list, chat completion, reasoning, code generation

How it works

The vllm/vllm-openai:cu130-nightly image includes native qwen3_next support.

Architecture: Hybrid of:

DeltaNet selective linear attention layers (36 of 48 layers) — subquadratic in sequence length, no KV cache
Full attention layers (12 of 48) — standard transformer attention with KV cache
Latent MoE (80B total, 3B active per token — same throughput profile as Nano)

Quantization is via compressed-tensors (llmcompressor) with LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1 — all 512 MoE experts are calibrated, not just sampled. vLLM auto-detects quantization from quantization_config in config.json; no --quantization flag needed.

Key environment variables

Variable	Reason
`VLLM_NVFP4_GEMM_BACKEND=marlin`	SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts
`VLLM_TEST_FORCE_FP8_MARLIN=1`	Forces FP8 Marlin path on GB10 SM12.1
`VLLM_USE_FLASHINFER_MOE_FP4=0`	FlashInfer MoE FP4 path not supported on GB10 SM12.1
`VLLM_MARLIN_USE_ATOMIC_ADD=1`	GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1

Key flags

Flag	Reason
`--dtype auto`	BF16 for non-quantized layers (DeltaNet, router gates, lm_head)
`--kv-cache-dtype fp8`	FP8 KV cache; applies to 12 full-attention layers only
`--gpu-memory-utilization 0.90`	0.90 × 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky)
`--max-model-len 262144`	Full native context; tested by saricles with FP8 KV cache
`--max-num-seqs 64`	Max concurrent requests
`--max-num-batched-tokens 8192`	Prevents OOM on long contexts
`--attention-backend flashinfer`	Required for FP8 KV cache + chunked prefill on GB10
`--enable-prefix-caching`	Reuses KV cache for repeated prompt prefixes (system prompts)
`--enable-chunked-prefill`	Reduces memory spikes during long-prompt processing
`--tool-call-parser qwen3_coder`	OpenAI-compatible tool calling

Requirements

Docker with nvidia-container-toolkit
Image: vllm/vllm-openai:cu130-nightly
HF token with access to gdubicki/Qwen3-Coder-Next-NVFP4-GB10 (gated — accept license first)
Model weights cached locally (auto-downloaded on first run, ~~45 GB): `~~/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support