YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
gdubicki/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10)
Runs gdubicki/Qwen3-Coder-Next-NVFP4-GB10 (quantized by saricles) via vLLM with an OpenAI-compatible API endpoint.
Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X).
Model overview
| Property | Value |
|---|---|
| Architecture | qwen3_next (Hybrid DeltaNet linear attention + full attention + latent MoE) |
| Base model | Qwen/Qwen3-Coder-Next |
| Parameters | 80B total, 3B active per token (512 experts, 10 active + 1 shared) |
| Layers | 48 (pattern: 3Γ DeltaNet linear β 1Γ full attention, repeating β 12 full attention) |
| Quantization | NVFP4 via llmcompressor + compressed-tensors; all 512 MoE experts calibrated |
| Kept in BF16 | lm_head, embed_tokens, linear_attn layers, mlp.gate, mlp.shared_expert_gate |
| KV cache | FP8 (only for the 12 full-attention layers) |
| Model size | ~45 GB (70% reduction from ~149 GB BF16) |
| Max context | 262,144 tokens (native; tested with FP8 KV cache by saricles) |
| Reasoning | Built-in chain-of-thought (<think> tags), ON by default |
Model features
| Feature | Support |
|---|---|
| Tool calling | β
Yes (--tool-call-parser qwen3_coder) |
| Reasoning / thinking mode | β
Yes (ON by default, toggleable via enable_thinking) |
| Languages | multilingual (code-focused) |
Performance
Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X):
| Metric | Value |
|---|---|
| Throughput | ~61 tok/s |
| Max context | 262,144 tokens |
| KV cache concurrency | 31.65Γ at 262K tokens (DeltaNet has no KV cache) |
Comparison across models on GB10:
| Model | Active params | tok/s |
|---|---|---|
| Gemma-4-31B-IT-NVFP4 (dense) | 31B | ~7 |
| Qwen3-32B-NVFP4 (dense) | 32.8B | ~11 |
| Nemotron-3-Super-120B-A12B-NVFP4 | 12B | ~16 |
| Qwen3-Coder-Next-NVFP4-GB10 | 3B | ~61 |
| Nemotron-3-Nano-30B-A3B-NVFP4 | 3B | ~61 |
Qwen3-Coder-Next is 80B total but 3B active β same throughput as Nemotron-3-Nano with full 262K context.
Quick start
# Required β model is gated on Hugging Face (accept license at gdubicki/Qwen3-Coder-Next-NVFP4-GB10 first):
export HF_TOKEN=hf_xxxx
bash start-qwen3-coder-next.sh
The script will:
- Stop and remove any existing
qwen3-coder-next-vllmcontainer - Flush the system page cache (frees unified memory before vLLM starts)
- Start the container in detached mode
- Poll
http://localhost:8000/healthuntil the API is ready (~45 GB download on first run)
Test the API
bash test-api.sh # localhost
bash test-api.sh 192.168.x.x # remote host
Reasoning (chain-of-thought)
Reasoning is ON by default. Toggle per request:
# Reasoning OFF
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}'
# Reasoning ON (default)
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}'
Cline configuration
- Open Cline settings (sidebar icon β gear icon)
- Fill in the fields:
| Field | Value |
|---|---|
| Provider | OpenAI Compatible |
| Base URL | http://<spark-ip>:8000/v1 |
| Model ID | gdubicki/Qwen3-Coder-Next-NVFP4-GB10 |
| API Key | dummy (any non-empty string) |
Files
| File | Purpose |
|---|---|
start-qwen3-coder-next.sh |
Full launcher: stop, cache flush, docker run, health poll |
docker-run.sh |
Bare docker run command with comments, for reference |
test-api.sh |
curl smoke tests: health, model list, chat completion, reasoning, code generation |
How it works
The vllm/vllm-openai:cu130-nightly image includes native qwen3_next support.
Architecture: Hybrid of:
- DeltaNet selective linear attention layers (36 of 48 layers) β subquadratic in sequence length, no KV cache
- Full attention layers (12 of 48) β standard transformer attention with KV cache
- Latent MoE (80B total, 3B active per token β same throughput profile as Nano)
Quantization is via compressed-tensors (llmcompressor) with LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1
β all 512 MoE experts are calibrated, not just sampled.
vLLM auto-detects quantization from quantization_config in config.json; no --quantization flag needed.
Key environment variables
| Variable | Reason |
|---|---|
VLLM_NVFP4_GEMM_BACKEND=marlin |
SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts |
VLLM_TEST_FORCE_FP8_MARLIN=1 |
Forces FP8 Marlin path on GB10 SM12.1 |
VLLM_USE_FLASHINFER_MOE_FP4=0 |
FlashInfer MoE FP4 path not supported on GB10 SM12.1 |
VLLM_MARLIN_USE_ATOMIC_ADD=1 |
GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1 |
Key flags
| Flag | Reason |
|---|---|
--dtype auto |
BF16 for non-quantized layers (DeltaNet, router gates, lm_head) |
--kv-cache-dtype fp8 |
FP8 KV cache; applies to 12 full-attention layers only |
--gpu-memory-utilization 0.90 |
0.90 Γ 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky) |
--max-model-len 262144 |
Full native context; tested by saricles with FP8 KV cache |
--max-num-seqs 64 |
Max concurrent requests |
--max-num-batched-tokens 8192 |
Prevents OOM on long contexts |
--attention-backend flashinfer |
Required for FP8 KV cache + chunked prefill on GB10 |
--enable-prefix-caching |
Reuses KV cache for repeated prompt prefixes (system prompts) |
--enable-chunked-prefill |
Reduces memory spikes during long-prompt processing |
--tool-call-parser qwen3_coder |
OpenAI-compatible tool calling |
Requirements
- Docker with
nvidia-container-toolkit - Image:
vllm/vllm-openai:cu130-nightly - HF token with access to
gdubicki/Qwen3-Coder-Next-NVFP4-GB10(gated β accept license first) - Model weights cached locally (auto-downloaded on first run,
45 GB): `/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`