| # gdubicki/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10) |
|
|
| Runs [`gdubicki/Qwen3-Coder-Next-NVFP4-GB10`](https://huggingface.co/gdubicki/Qwen3-Coder-Next-NVFP4-GB10) (quantized by [saricles](https://huggingface.co/saricles/Qwen3-Coder-Next-NVFP4-GB10)) via vLLM with an OpenAI-compatible API endpoint. |
| Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X). |
|
|
| ## Model overview |
|
|
| | Property | Value | |
| |----------|-------| |
| | Architecture | `qwen3_next` (Hybrid DeltaNet linear attention + full attention + latent MoE) | |
| | Base model | `Qwen/Qwen3-Coder-Next` | |
| | Parameters | 80B total, **3B active** per token (512 experts, 10 active + 1 shared) | |
| | Layers | 48 (pattern: 3Γ DeltaNet linear β 1Γ full attention, repeating β 12 full attention) | |
| | Quantization | NVFP4 via `llmcompressor` + `compressed-tensors`; all 512 MoE experts calibrated | |
| | Kept in BF16 | `lm_head`, `embed_tokens`, `linear_attn` layers, `mlp.gate`, `mlp.shared_expert_gate` | |
| | KV cache | FP8 (only for the 12 full-attention layers) | |
| | Model size | ~45 GB (70% reduction from ~149 GB BF16) | |
| | Max context | 262,144 tokens (native; tested with FP8 KV cache by saricles) | |
| | Reasoning | Built-in chain-of-thought (`<think>` tags), ON by default | |
|
|
| ## Model features |
|
|
| | Feature | Support | |
| |---------|---------| |
| | Tool calling | β
Yes (`--tool-call-parser qwen3_coder`) | |
| | Reasoning / thinking mode | β
Yes (ON by default, toggleable via `enable_thinking`) | |
| | Languages | multilingual (code-focused) | |
|
|
| ## Performance |
|
|
| Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X): |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Throughput | ~61 tok/s | |
| | Max context | 262,144 tokens | |
| | KV cache concurrency | **31.65Γ** at 262K tokens (DeltaNet has no KV cache) | |
|
|
| Comparison across models on GB10: |
|
|
| | Model | Active params | tok/s | |
| |-------|--------------|-------| |
| | Gemma-4-31B-IT-NVFP4 (dense) | 31B | ~7 | |
| | Qwen3-32B-NVFP4 (dense) | 32.8B | ~11 | |
| | Nemotron-3-Super-120B-A12B-NVFP4 | 12B | ~16 | |
| | **Qwen3-Coder-Next-NVFP4-GB10** | **3B** | **~61** | |
| | Nemotron-3-Nano-30B-A3B-NVFP4 | 3B | ~61 | |
|
|
| Qwen3-Coder-Next is 80B total but 3B active β same throughput as Nemotron-3-Nano with full 262K context. |
|
|
| ## Quick start |
|
|
| ```bash |
| # Required β model is gated on Hugging Face (accept license at gdubicki/Qwen3-Coder-Next-NVFP4-GB10 first): |
| export HF_TOKEN=hf_xxxx |
| |
| bash start-qwen3-coder-next.sh |
| ``` |
|
|
| The script will: |
| 1. Stop and remove any existing `qwen3-coder-next-vllm` container |
| 2. Flush the system page cache (frees unified memory before vLLM starts) |
| 3. Start the container in detached mode |
| 4. Poll `http://localhost:8000/health` until the API is ready (~45 GB download on first run) |
|
|
| ## Test the API |
|
|
| ```bash |
| bash test-api.sh # localhost |
| bash test-api.sh 192.168.x.x # remote host |
| ``` |
|
|
| ## Reasoning (chain-of-thought) |
|
|
| Reasoning is **ON by default**. Toggle per request: |
|
|
| ```bash |
| # Reasoning OFF |
| curl -s -X POST http://localhost:8000/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}' |
| |
| # Reasoning ON (default) |
| curl -s -X POST http://localhost:8000/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}' |
| ``` |
|
|
| ## Cline configuration |
|
|
| 1. Open Cline settings (sidebar icon β gear icon) |
| 2. Fill in the fields: |
|
|
| | Field | Value | |
| |-------|-------| |
| | Provider | OpenAI Compatible | |
| | Base URL | `http://<spark-ip>:8000/v1` | |
| | Model ID | `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` | |
| | API Key | `dummy` (any non-empty string) | |
|
|
| ## Files |
|
|
| | File | Purpose | |
| |------|---------| |
| | `start-qwen3-coder-next.sh` | Full launcher: stop, cache flush, docker run, health poll | |
| | `docker-run.sh` | Bare `docker run` command with comments, for reference | |
| | `test-api.sh` | curl smoke tests: health, model list, chat completion, reasoning, code generation | |
|
|
| ## How it works |
|
|
| The `vllm/vllm-openai:cu130-nightly` image includes native `qwen3_next` support. |
|
|
| **Architecture**: Hybrid of: |
| - **DeltaNet** selective linear attention layers (36 of 48 layers) β subquadratic in sequence length, no KV cache |
| - **Full attention** layers (12 of 48) β standard transformer attention with KV cache |
| - **Latent MoE** (80B total, 3B active per token β same throughput profile as Nano) |
|
|
| Quantization is via `compressed-tensors` (llmcompressor) with `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1` |
| β all 512 MoE experts are calibrated, not just sampled. |
| vLLM auto-detects quantization from `quantization_config` in `config.json`; no `--quantization` flag needed. |
|
|
| ### Key environment variables |
|
|
| | Variable | Reason | |
| |----------|--------| |
| | `VLLM_NVFP4_GEMM_BACKEND=marlin` | SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts | |
| | `VLLM_TEST_FORCE_FP8_MARLIN=1` | Forces FP8 Marlin path on GB10 SM12.1 | |
| | `VLLM_USE_FLASHINFER_MOE_FP4=0` | FlashInfer MoE FP4 path not supported on GB10 SM12.1 | |
| | `VLLM_MARLIN_USE_ATOMIC_ADD=1` | GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1 | |
|
|
| ### Key flags |
|
|
| | Flag | Reason | |
| |------|--------| |
| | `--dtype auto` | BF16 for non-quantized layers (DeltaNet, router gates, lm_head) | |
| | `--kv-cache-dtype fp8` | FP8 KV cache; applies to 12 full-attention layers only | |
| | `--gpu-memory-utilization 0.90` | 0.90 Γ 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky) | |
| | `--max-model-len 262144` | Full native context; tested by saricles with FP8 KV cache | |
| | `--max-num-seqs 64` | Max concurrent requests | |
| | `--max-num-batched-tokens 8192` | Prevents OOM on long contexts | |
| | `--attention-backend flashinfer` | Required for FP8 KV cache + chunked prefill on GB10 | |
| | `--enable-prefix-caching` | Reuses KV cache for repeated prompt prefixes (system prompts) | |
| | `--enable-chunked-prefill` | Reduces memory spikes during long-prompt processing | |
| | `--tool-call-parser qwen3_coder` | OpenAI-compatible tool calling | |
|
|
| ## Requirements |
|
|
| - Docker with `nvidia-container-toolkit` |
| - Image: `vllm/vllm-openai:cu130-nightly` |
| - HF token with access to `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` (gated β accept license first) |
| - Model weights cached locally (auto-downloaded on first run, ~45 GB): |
| `~/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/` |
|
|