# gdubicki/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10) Runs [`gdubicki/Qwen3-Coder-Next-NVFP4-GB10`](https://huggingface.co/gdubicki/Qwen3-Coder-Next-NVFP4-GB10) (quantized by [saricles](https://huggingface.co/saricles/Qwen3-Coder-Next-NVFP4-GB10)) via vLLM with an OpenAI-compatible API endpoint. Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X). ## Model overview | Property | Value | |----------|-------| | Architecture | `qwen3_next` (Hybrid DeltaNet linear attention + full attention + latent MoE) | | Base model | `Qwen/Qwen3-Coder-Next` | | Parameters | 80B total, **3B active** per token (512 experts, 10 active + 1 shared) | | Layers | 48 (pattern: 3× DeltaNet linear → 1× full attention, repeating → 12 full attention) | | Quantization | NVFP4 via `llmcompressor` + `compressed-tensors`; all 512 MoE experts calibrated | | Kept in BF16 | `lm_head`, `embed_tokens`, `linear_attn` layers, `mlp.gate`, `mlp.shared_expert_gate` | | KV cache | FP8 (only for the 12 full-attention layers) | | Model size | ~45 GB (70% reduction from ~149 GB BF16) | | Max context | 262,144 tokens (native; tested with FP8 KV cache by saricles) | | Reasoning | Built-in chain-of-thought (`` tags), ON by default | ## Model features | Feature | Support | |---------|---------| | Tool calling | ✅ Yes (`--tool-call-parser qwen3_coder`) | | Reasoning / thinking mode | ✅ Yes (ON by default, toggleable via `enable_thinking`) | | Languages | multilingual (code-focused) | ## Performance Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X): | Metric | Value | |--------|-------| | Throughput | ~61 tok/s | | Max context | 262,144 tokens | | KV cache concurrency | **31.65×** at 262K tokens (DeltaNet has no KV cache) | Comparison across models on GB10: | Model | Active params | tok/s | |-------|--------------|-------| | Gemma-4-31B-IT-NVFP4 (dense) | 31B | ~7 | | Qwen3-32B-NVFP4 (dense) | 32.8B | ~11 | | Nemotron-3-Super-120B-A12B-NVFP4 | 12B | ~16 | | **Qwen3-Coder-Next-NVFP4-GB10** | **3B** | **~61** | | Nemotron-3-Nano-30B-A3B-NVFP4 | 3B | ~61 | Qwen3-Coder-Next is 80B total but 3B active — same throughput as Nemotron-3-Nano with full 262K context. ## Quick start ```bash # Required — model is gated on Hugging Face (accept license at gdubicki/Qwen3-Coder-Next-NVFP4-GB10 first): export HF_TOKEN=hf_xxxx bash start-qwen3-coder-next.sh ``` The script will: 1. Stop and remove any existing `qwen3-coder-next-vllm` container 2. Flush the system page cache (frees unified memory before vLLM starts) 3. Start the container in detached mode 4. Poll `http://localhost:8000/health` until the API is ready (~45 GB download on first run) ## Test the API ```bash bash test-api.sh # localhost bash test-api.sh 192.168.x.x # remote host ``` ## Reasoning (chain-of-thought) Reasoning is **ON by default**. Toggle per request: ```bash # Reasoning OFF curl -s -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}' # Reasoning ON (default) curl -s -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}' ``` ## Cline configuration 1. Open Cline settings (sidebar icon → gear icon) 2. Fill in the fields: | Field | Value | |-------|-------| | Provider | OpenAI Compatible | | Base URL | `http://:8000/v1` | | Model ID | `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` | | API Key | `dummy` (any non-empty string) | ## Files | File | Purpose | |------|---------| | `start-qwen3-coder-next.sh` | Full launcher: stop, cache flush, docker run, health poll | | `docker-run.sh` | Bare `docker run` command with comments, for reference | | `test-api.sh` | curl smoke tests: health, model list, chat completion, reasoning, code generation | ## How it works The `vllm/vllm-openai:cu130-nightly` image includes native `qwen3_next` support. **Architecture**: Hybrid of: - **DeltaNet** selective linear attention layers (36 of 48 layers) — subquadratic in sequence length, no KV cache - **Full attention** layers (12 of 48) — standard transformer attention with KV cache - **Latent MoE** (80B total, 3B active per token — same throughput profile as Nano) Quantization is via `compressed-tensors` (llmcompressor) with `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1` — all 512 MoE experts are calibrated, not just sampled. vLLM auto-detects quantization from `quantization_config` in `config.json`; no `--quantization` flag needed. ### Key environment variables | Variable | Reason | |----------|--------| | `VLLM_NVFP4_GEMM_BACKEND=marlin` | SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts | | `VLLM_TEST_FORCE_FP8_MARLIN=1` | Forces FP8 Marlin path on GB10 SM12.1 | | `VLLM_USE_FLASHINFER_MOE_FP4=0` | FlashInfer MoE FP4 path not supported on GB10 SM12.1 | | `VLLM_MARLIN_USE_ATOMIC_ADD=1` | GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1 | ### Key flags | Flag | Reason | |------|--------| | `--dtype auto` | BF16 for non-quantized layers (DeltaNet, router gates, lm_head) | | `--kv-cache-dtype fp8` | FP8 KV cache; applies to 12 full-attention layers only | | `--gpu-memory-utilization 0.90` | 0.90 × 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky) | | `--max-model-len 262144` | Full native context; tested by saricles with FP8 KV cache | | `--max-num-seqs 64` | Max concurrent requests | | `--max-num-batched-tokens 8192` | Prevents OOM on long contexts | | `--attention-backend flashinfer` | Required for FP8 KV cache + chunked prefill on GB10 | | `--enable-prefix-caching` | Reuses KV cache for repeated prompt prefixes (system prompts) | | `--enable-chunked-prefill` | Reduces memory spikes during long-prompt processing | | `--tool-call-parser qwen3_coder` | OpenAI-compatible tool calling | ## Requirements - Docker with `nvidia-container-toolkit` - Image: `vllm/vllm-openai:cu130-nightly` - HF token with access to `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` (gated — accept license first) - Model weights cached locally (auto-downloaded on first run, ~45 GB): `~/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`