qwen3-coder-next / README.md
gdubicki's picture
Switch model source to gdubicki/Qwen3-Coder-Next-NVFP4-GB10
088ba1d verified
# gdubicki/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10)
Runs [`gdubicki/Qwen3-Coder-Next-NVFP4-GB10`](https://huggingface.co/gdubicki/Qwen3-Coder-Next-NVFP4-GB10) (quantized by [saricles](https://huggingface.co/saricles/Qwen3-Coder-Next-NVFP4-GB10)) via vLLM with an OpenAI-compatible API endpoint.
Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X).
## Model overview
| Property | Value |
|----------|-------|
| Architecture | `qwen3_next` (Hybrid DeltaNet linear attention + full attention + latent MoE) |
| Base model | `Qwen/Qwen3-Coder-Next` |
| Parameters | 80B total, **3B active** per token (512 experts, 10 active + 1 shared) |
| Layers | 48 (pattern: 3Γ— DeltaNet linear β†’ 1Γ— full attention, repeating β†’ 12 full attention) |
| Quantization | NVFP4 via `llmcompressor` + `compressed-tensors`; all 512 MoE experts calibrated |
| Kept in BF16 | `lm_head`, `embed_tokens`, `linear_attn` layers, `mlp.gate`, `mlp.shared_expert_gate` |
| KV cache | FP8 (only for the 12 full-attention layers) |
| Model size | ~45 GB (70% reduction from ~149 GB BF16) |
| Max context | 262,144 tokens (native; tested with FP8 KV cache by saricles) |
| Reasoning | Built-in chain-of-thought (`<think>` tags), ON by default |
## Model features
| Feature | Support |
|---------|---------|
| Tool calling | βœ… Yes (`--tool-call-parser qwen3_coder`) |
| Reasoning / thinking mode | βœ… Yes (ON by default, toggleable via `enable_thinking`) |
| Languages | multilingual (code-focused) |
## Performance
Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X):
| Metric | Value |
|--------|-------|
| Throughput | ~61 tok/s |
| Max context | 262,144 tokens |
| KV cache concurrency | **31.65Γ—** at 262K tokens (DeltaNet has no KV cache) |
Comparison across models on GB10:
| Model | Active params | tok/s |
|-------|--------------|-------|
| Gemma-4-31B-IT-NVFP4 (dense) | 31B | ~7 |
| Qwen3-32B-NVFP4 (dense) | 32.8B | ~11 |
| Nemotron-3-Super-120B-A12B-NVFP4 | 12B | ~16 |
| **Qwen3-Coder-Next-NVFP4-GB10** | **3B** | **~61** |
| Nemotron-3-Nano-30B-A3B-NVFP4 | 3B | ~61 |
Qwen3-Coder-Next is 80B total but 3B active β€” same throughput as Nemotron-3-Nano with full 262K context.
## Quick start
```bash
# Required β€” model is gated on Hugging Face (accept license at gdubicki/Qwen3-Coder-Next-NVFP4-GB10 first):
export HF_TOKEN=hf_xxxx
bash start-qwen3-coder-next.sh
```
The script will:
1. Stop and remove any existing `qwen3-coder-next-vllm` container
2. Flush the system page cache (frees unified memory before vLLM starts)
3. Start the container in detached mode
4. Poll `http://localhost:8000/health` until the API is ready (~45 GB download on first run)
## Test the API
```bash
bash test-api.sh # localhost
bash test-api.sh 192.168.x.x # remote host
```
## Reasoning (chain-of-thought)
Reasoning is **ON by default**. Toggle per request:
```bash
# Reasoning OFF
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}'
# Reasoning ON (default)
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}'
```
## Cline configuration
1. Open Cline settings (sidebar icon β†’ gear icon)
2. Fill in the fields:
| Field | Value |
|-------|-------|
| Provider | OpenAI Compatible |
| Base URL | `http://<spark-ip>:8000/v1` |
| Model ID | `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` |
| API Key | `dummy` (any non-empty string) |
## Files
| File | Purpose |
|------|---------|
| `start-qwen3-coder-next.sh` | Full launcher: stop, cache flush, docker run, health poll |
| `docker-run.sh` | Bare `docker run` command with comments, for reference |
| `test-api.sh` | curl smoke tests: health, model list, chat completion, reasoning, code generation |
## How it works
The `vllm/vllm-openai:cu130-nightly` image includes native `qwen3_next` support.
**Architecture**: Hybrid of:
- **DeltaNet** selective linear attention layers (36 of 48 layers) β€” subquadratic in sequence length, no KV cache
- **Full attention** layers (12 of 48) β€” standard transformer attention with KV cache
- **Latent MoE** (80B total, 3B active per token β€” same throughput profile as Nano)
Quantization is via `compressed-tensors` (llmcompressor) with `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1`
β€” all 512 MoE experts are calibrated, not just sampled.
vLLM auto-detects quantization from `quantization_config` in `config.json`; no `--quantization` flag needed.
### Key environment variables
| Variable | Reason |
|----------|--------|
| `VLLM_NVFP4_GEMM_BACKEND=marlin` | SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts |
| `VLLM_TEST_FORCE_FP8_MARLIN=1` | Forces FP8 Marlin path on GB10 SM12.1 |
| `VLLM_USE_FLASHINFER_MOE_FP4=0` | FlashInfer MoE FP4 path not supported on GB10 SM12.1 |
| `VLLM_MARLIN_USE_ATOMIC_ADD=1` | GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1 |
### Key flags
| Flag | Reason |
|------|--------|
| `--dtype auto` | BF16 for non-quantized layers (DeltaNet, router gates, lm_head) |
| `--kv-cache-dtype fp8` | FP8 KV cache; applies to 12 full-attention layers only |
| `--gpu-memory-utilization 0.90` | 0.90 Γ— 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky) |
| `--max-model-len 262144` | Full native context; tested by saricles with FP8 KV cache |
| `--max-num-seqs 64` | Max concurrent requests |
| `--max-num-batched-tokens 8192` | Prevents OOM on long contexts |
| `--attention-backend flashinfer` | Required for FP8 KV cache + chunked prefill on GB10 |
| `--enable-prefix-caching` | Reuses KV cache for repeated prompt prefixes (system prompts) |
| `--enable-chunked-prefill` | Reduces memory spikes during long-prompt processing |
| `--tool-call-parser qwen3_coder` | OpenAI-compatible tool calling |
## Requirements
- Docker with `nvidia-container-toolkit`
- Image: `vllm/vllm-openai:cu130-nightly`
- HF token with access to `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` (gated β€” accept license first)
- Model weights cached locally (auto-downloaded on first run, ~45 GB):
`~/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`