Switch to saricles/Qwen3-Coder-Next-NVFP4-GB10 with GB10-optimized settings
Browse files- README.md +152 -0
- docker-run.sh +100 -0
- start-qwen3-coder-next.sh +200 -0
- test-api.sh +149 -0
README.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# saricles/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10)
|
| 2 |
+
|
| 3 |
+
Runs [`saricles/Qwen3-Coder-Next-NVFP4-GB10`](https://huggingface.co/saricles/Qwen3-Coder-Next-NVFP4-GB10) via vLLM with an OpenAI-compatible API endpoint.
|
| 4 |
+
Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X).
|
| 5 |
+
|
| 6 |
+
## Model overview
|
| 7 |
+
|
| 8 |
+
| Property | Value |
|
| 9 |
+
|----------|-------|
|
| 10 |
+
| Architecture | `qwen3_next` (Hybrid DeltaNet linear attention + full attention + latent MoE) |
|
| 11 |
+
| Base model | `Qwen/Qwen3-Coder-Next` |
|
| 12 |
+
| Parameters | 80B total, **3B active** per token (512 experts, 10 active + 1 shared) |
|
| 13 |
+
| Layers | 48 (pattern: 3× DeltaNet linear → 1× full attention, repeating → 12 full attention) |
|
| 14 |
+
| Quantization | NVFP4 via `llmcompressor` + `compressed-tensors`; all 512 MoE experts calibrated |
|
| 15 |
+
| Kept in BF16 | `lm_head`, `embed_tokens`, `linear_attn` layers, `mlp.gate`, `mlp.shared_expert_gate` |
|
| 16 |
+
| KV cache | FP8 (only for the 12 full-attention layers) |
|
| 17 |
+
| Model size | ~45 GB (70% reduction from ~149 GB BF16) |
|
| 18 |
+
| Max context | 262,144 tokens (native; tested with FP8 KV cache by saricles) |
|
| 19 |
+
| Reasoning | Built-in chain-of-thought (`<think>` tags), ON by default |
|
| 20 |
+
|
| 21 |
+
## Model features
|
| 22 |
+
|
| 23 |
+
| Feature | Support |
|
| 24 |
+
|---------|---------|
|
| 25 |
+
| Tool calling | ✅ Yes (`--tool-call-parser qwen3_coder`) |
|
| 26 |
+
| Reasoning / thinking mode | ✅ Yes (ON by default, toggleable via `enable_thinking`) |
|
| 27 |
+
| Languages | multilingual (code-focused) |
|
| 28 |
+
|
| 29 |
+
## Performance
|
| 30 |
+
|
| 31 |
+
Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X):
|
| 32 |
+
|
| 33 |
+
| Metric | Value |
|
| 34 |
+
|--------|-------|
|
| 35 |
+
| Throughput | ~61 tok/s |
|
| 36 |
+
| Max context | 262,144 tokens |
|
| 37 |
+
| KV cache concurrency | **31.65×** at 262K tokens (DeltaNet has no KV cache) |
|
| 38 |
+
|
| 39 |
+
Comparison across models on GB10:
|
| 40 |
+
|
| 41 |
+
| Model | Active params | tok/s |
|
| 42 |
+
|-------|--------------|-------|
|
| 43 |
+
| Gemma-4-31B-IT-NVFP4 (dense) | 31B | ~7 |
|
| 44 |
+
| Qwen3-32B-NVFP4 (dense) | 32.8B | ~11 |
|
| 45 |
+
| Nemotron-3-Super-120B-A12B-NVFP4 | 12B | ~16 |
|
| 46 |
+
| **Qwen3-Coder-Next-NVFP4-GB10** | **3B** | **~61** |
|
| 47 |
+
| Nemotron-3-Nano-30B-A3B-NVFP4 | 3B | ~61 |
|
| 48 |
+
|
| 49 |
+
Qwen3-Coder-Next is 80B total but 3B active — same throughput as Nemotron-3-Nano with full 262K context.
|
| 50 |
+
|
| 51 |
+
## Quick start
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
# Required — model is gated on Hugging Face (accept license at saricles/Qwen3-Coder-Next-NVFP4-GB10 first):
|
| 55 |
+
export HF_TOKEN=hf_xxxx
|
| 56 |
+
|
| 57 |
+
bash start-qwen3-coder-next.sh
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
The script will:
|
| 61 |
+
1. Stop and remove any existing `qwen3-coder-next-vllm` container
|
| 62 |
+
2. Flush the system page cache (frees unified memory before vLLM starts)
|
| 63 |
+
3. Start the container in detached mode
|
| 64 |
+
4. Poll `http://localhost:8000/health` until the API is ready (~45 GB download on first run)
|
| 65 |
+
|
| 66 |
+
## Test the API
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
bash test-api.sh # localhost
|
| 70 |
+
bash test-api.sh 192.168.x.x # remote host
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Reasoning (chain-of-thought)
|
| 74 |
+
|
| 75 |
+
Reasoning is **ON by default**. Toggle per request:
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
# Reasoning OFF
|
| 79 |
+
curl -s -X POST http://localhost:8000/v1/chat/completions \
|
| 80 |
+
-H "Content-Type: application/json" \
|
| 81 |
+
-d '{"model":"saricles/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}'
|
| 82 |
+
|
| 83 |
+
# Reasoning ON (default)
|
| 84 |
+
curl -s -X POST http://localhost:8000/v1/chat/completions \
|
| 85 |
+
-H "Content-Type: application/json" \
|
| 86 |
+
-d '{"model":"saricles/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}'
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
## Cline configuration
|
| 90 |
+
|
| 91 |
+
1. Open Cline settings (sidebar icon → gear icon)
|
| 92 |
+
2. Fill in the fields:
|
| 93 |
+
|
| 94 |
+
| Field | Value |
|
| 95 |
+
|-------|-------|
|
| 96 |
+
| Provider | OpenAI Compatible |
|
| 97 |
+
| Base URL | `http://<spark-ip>:8000/v1` |
|
| 98 |
+
| Model ID | `saricles/Qwen3-Coder-Next-NVFP4-GB10` |
|
| 99 |
+
| API Key | `dummy` (any non-empty string) |
|
| 100 |
+
|
| 101 |
+
## Files
|
| 102 |
+
|
| 103 |
+
| File | Purpose |
|
| 104 |
+
|------|---------|
|
| 105 |
+
| `start-qwen3-coder-next.sh` | Full launcher: stop, cache flush, docker run, health poll |
|
| 106 |
+
| `docker-run.sh` | Bare `docker run` command with comments, for reference |
|
| 107 |
+
| `test-api.sh` | curl smoke tests: health, model list, chat completion, reasoning, code generation |
|
| 108 |
+
|
| 109 |
+
## How it works
|
| 110 |
+
|
| 111 |
+
The `vllm/vllm-openai:cu130-nightly` image includes native `qwen3_next` support.
|
| 112 |
+
|
| 113 |
+
**Architecture**: Hybrid of:
|
| 114 |
+
- **DeltaNet** selective linear attention layers (36 of 48 layers) — subquadratic in sequence length, no KV cache
|
| 115 |
+
- **Full attention** layers (12 of 48) — standard transformer attention with KV cache
|
| 116 |
+
- **Latent MoE** (80B total, 3B active per token — same throughput profile as Nano)
|
| 117 |
+
|
| 118 |
+
Quantization is via `compressed-tensors` (llmcompressor) with `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1`
|
| 119 |
+
— all 512 MoE experts are calibrated, not just sampled.
|
| 120 |
+
vLLM auto-detects quantization from `quantization_config` in `config.json`; no `--quantization` flag needed.
|
| 121 |
+
|
| 122 |
+
### Key environment variables
|
| 123 |
+
|
| 124 |
+
| Variable | Reason |
|
| 125 |
+
|----------|--------|
|
| 126 |
+
| `VLLM_NVFP4_GEMM_BACKEND=marlin` | SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts |
|
| 127 |
+
| `VLLM_TEST_FORCE_FP8_MARLIN=1` | Forces FP8 Marlin path on GB10 SM12.1 |
|
| 128 |
+
| `VLLM_USE_FLASHINFER_MOE_FP4=0` | FlashInfer MoE FP4 path not supported on GB10 SM12.1 |
|
| 129 |
+
| `VLLM_MARLIN_USE_ATOMIC_ADD=1` | GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1 |
|
| 130 |
+
|
| 131 |
+
### Key flags
|
| 132 |
+
|
| 133 |
+
| Flag | Reason |
|
| 134 |
+
|------|--------|
|
| 135 |
+
| `--dtype auto` | BF16 for non-quantized layers (DeltaNet, router gates, lm_head) |
|
| 136 |
+
| `--kv-cache-dtype fp8` | FP8 KV cache; applies to 12 full-attention layers only |
|
| 137 |
+
| `--gpu-memory-utilization 0.90` | 0.90 × 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky) |
|
| 138 |
+
| `--max-model-len 262144` | Full native context; tested by saricles with FP8 KV cache |
|
| 139 |
+
| `--max-num-seqs 64` | Max concurrent requests |
|
| 140 |
+
| `--max-num-batched-tokens 8192` | Prevents OOM on long contexts |
|
| 141 |
+
| `--attention-backend flashinfer` | Required for FP8 KV cache + chunked prefill on GB10 |
|
| 142 |
+
| `--enable-prefix-caching` | Reuses KV cache for repeated prompt prefixes (system prompts) |
|
| 143 |
+
| `--enable-chunked-prefill` | Reduces memory spikes during long-prompt processing |
|
| 144 |
+
| `--tool-call-parser qwen3_coder` | OpenAI-compatible tool calling |
|
| 145 |
+
|
| 146 |
+
## Requirements
|
| 147 |
+
|
| 148 |
+
- Docker with `nvidia-container-toolkit`
|
| 149 |
+
- Image: `vllm/vllm-openai:cu130-nightly`
|
| 150 |
+
- HF token with access to `saricles/Qwen3-Coder-Next-NVFP4-GB10` (gated — accept license first)
|
| 151 |
+
- Model weights cached locally (auto-downloaded on first run, ~45 GB):
|
| 152 |
+
`~/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`
|
docker-run.sh
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# docker-run.sh — bare docker run command (without start-qwen3-coder-next.sh lifecycle logic)
|
| 3 |
+
# Useful for manual testing or embedding in other scripts.
|
| 4 |
+
#
|
| 5 |
+
# Usage: bash docker-run.sh
|
| 6 |
+
#
|
| 7 |
+
# Environment variables:
|
| 8 |
+
# HF_TOKEN — optional Hugging Face token (required for gated models)
|
| 9 |
+
# HF_CACHE — local weight cache path (default: ~/.cache/huggingface)
|
| 10 |
+
|
| 11 |
+
set -euo pipefail
|
| 12 |
+
|
| 13 |
+
HF_CACHE="${HF_CACHE:-${HOME}/.cache/huggingface}"
|
| 14 |
+
mkdir -p "${HF_CACHE}"
|
| 15 |
+
|
| 16 |
+
docker run \
|
| 17 |
+
--name qwen3-coder-next-vllm \
|
| 18 |
+
--rm \
|
| 19 |
+
--runtime=nvidia \
|
| 20 |
+
--gpus all \
|
| 21 |
+
-p 0.0.0.0:8000:8000 \
|
| 22 |
+
-v "${HF_CACHE}:/root/.cache/huggingface" \
|
| 23 |
+
--shm-size=32g \
|
| 24 |
+
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
|
| 25 |
+
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
|
| 26 |
+
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
|
| 27 |
+
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
|
| 28 |
+
${HF_TOKEN:+-e HF_TOKEN="${HF_TOKEN}"} \
|
| 29 |
+
vllm/vllm-openai:cu130-nightly \
|
| 30 |
+
saricles/Qwen3-Coder-Next-NVFP4-GB10 \
|
| 31 |
+
--dtype auto \
|
| 32 |
+
--gpu-memory-utilization 0.90 \
|
| 33 |
+
--kv-cache-dtype fp8 \
|
| 34 |
+
--max-model-len 262144 \
|
| 35 |
+
--max-num-seqs 64 \
|
| 36 |
+
--max-num-batched-tokens 8192 \
|
| 37 |
+
--attention-backend flashinfer \
|
| 38 |
+
--enable-prefix-caching \
|
| 39 |
+
--enable-chunked-prefill \
|
| 40 |
+
--enable-auto-tool-choice \
|
| 41 |
+
--tool-call-parser qwen3_coder \
|
| 42 |
+
--host 0.0.0.0 \
|
| 43 |
+
--port 8000
|
| 44 |
+
|
| 45 |
+
# ---------------------------------------------------------------------------
|
| 46 |
+
# Flag reference:
|
| 47 |
+
#
|
| 48 |
+
# vllm/vllm-openai:cu130-nightly
|
| 49 |
+
# Native qwen3_next support (vLLM 0.19+).
|
| 50 |
+
#
|
| 51 |
+
# VLLM_NVFP4_GEMM_BACKEND=marlin
|
| 52 |
+
# SM12.1 (GB10) has no native CUTLASS FP4 kernel.
|
| 53 |
+
# Marlin handles NVFP4 W4A16 GEMM — 15% faster than CUTLASS for 512 experts.
|
| 54 |
+
#
|
| 55 |
+
# VLLM_TEST_FORCE_FP8_MARLIN=1
|
| 56 |
+
# Forces FP8 Marlin path on GB10 SM12.1.
|
| 57 |
+
#
|
| 58 |
+
# VLLM_USE_FLASHINFER_MOE_FP4=0
|
| 59 |
+
# FlashInfer MoE FP4 path not supported on GB10 SM12.1.
|
| 60 |
+
#
|
| 61 |
+
# VLLM_MARLIN_USE_ATOMIC_ADD=1
|
| 62 |
+
# GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1.
|
| 63 |
+
#
|
| 64 |
+
# No --quantization flag
|
| 65 |
+
# compressed-tensors format is auto-detected from config.json.
|
| 66 |
+
#
|
| 67 |
+
# --dtype auto
|
| 68 |
+
# BF16 for non-quantized layers (DeltaNet linear_attn, router gates, lm_head).
|
| 69 |
+
#
|
| 70 |
+
# --gpu-memory-utilization 0.90
|
| 71 |
+
# 0.90 × 128 GB = 115 GB for vLLM. Weights: ~43 GB. KV cache: ~72 GB.
|
| 72 |
+
# Safe limit per saricles testing (0.93 is risky).
|
| 73 |
+
#
|
| 74 |
+
# --kv-cache-dtype fp8
|
| 75 |
+
# FP8 KV cache. Only applies to the 12 full-attention layers (not DeltaNet).
|
| 76 |
+
#
|
| 77 |
+
# --max-model-len 262144
|
| 78 |
+
# Full native context. Tested with FP8 KV cache by saricles.
|
| 79 |
+
#
|
| 80 |
+
# --max-num-seqs 64
|
| 81 |
+
# Max concurrent requests.
|
| 82 |
+
#
|
| 83 |
+
# --max-num-batched-tokens 8192
|
| 84 |
+
# Limits tokens per batch — prevents OOM on long contexts.
|
| 85 |
+
#
|
| 86 |
+
# --attention-backend flashinfer
|
| 87 |
+
# Required for FP8 KV cache + chunked prefill on GB10.
|
| 88 |
+
#
|
| 89 |
+
# --enable-prefix-caching
|
| 90 |
+
# Reuses KV cache for repeated prompt prefixes (system prompts, etc.).
|
| 91 |
+
#
|
| 92 |
+
# --enable-chunked-prefill
|
| 93 |
+
# Reduces memory spikes during long-prompt processing.
|
| 94 |
+
#
|
| 95 |
+
# --enable-auto-tool-choice --tool-call-parser qwen3_coder
|
| 96 |
+
# Enables OpenAI-compatible tool calling for this model.
|
| 97 |
+
#
|
| 98 |
+
# --host 0.0.0.0 --port 8000
|
| 99 |
+
# OpenAI-compatible REST API, reachable from LAN.
|
| 100 |
+
# ---------------------------------------------------------------------------
|
start-qwen3-coder-next.sh
ADDED
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# start-qwen3-coder-next.sh — launch GadflyII/Qwen3-Coder-Next-NVFP4 via vLLM on DGX Spark (GB10)
|
| 3 |
+
# Requirements: Docker with nvidia-container-toolkit, vllm/vllm-openai:cu130-nightly
|
| 4 |
+
#
|
| 5 |
+
# Usage:
|
| 6 |
+
# bash start-qwen3-coder-next.sh
|
| 7 |
+
# HF_TOKEN=hf_xxxx bash start-qwen3-coder-next.sh
|
| 8 |
+
#
|
| 9 |
+
# Environment variables:
|
| 10 |
+
# HF_TOKEN — HF token (required when the model is gated on huggingface.co)
|
| 11 |
+
# HF_CACHE_DIR — local weight cache directory (default: ~/.cache/huggingface)
|
| 12 |
+
# MAX_MODEL_LEN — context length (default: 131072; model supports up to 262144)
|
| 13 |
+
|
| 14 |
+
set -euo pipefail
|
| 15 |
+
|
| 16 |
+
# ---------------------------------------------------------------------------
|
| 17 |
+
# Configuration
|
| 18 |
+
# ---------------------------------------------------------------------------
|
| 19 |
+
CONTAINER_NAME="qwen3-coder-next-vllm"
|
| 20 |
+
IMAGE="vllm/vllm-openai:cu130-nightly"
|
| 21 |
+
MODEL="saricles/Qwen3-Coder-Next-NVFP4-GB10"
|
| 22 |
+
PORT=8000
|
| 23 |
+
MAX_MODEL_LEN="${MAX_MODEL_LEN:-262144}"
|
| 24 |
+
|
| 25 |
+
HF_CACHE_DIR="${HF_CACHE_DIR:-${HOME}/.cache/huggingface}"
|
| 26 |
+
HF_TOKEN="${HF_TOKEN:-}"
|
| 27 |
+
|
| 28 |
+
# ---------------------------------------------------------------------------
|
| 29 |
+
# 1. Stop existing container (if running or stopped)
|
| 30 |
+
# ---------------------------------------------------------------------------
|
| 31 |
+
if docker ps --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
|
| 32 |
+
echo "[INFO] Container '${CONTAINER_NAME}' is running — stopping..."
|
| 33 |
+
docker stop "${CONTAINER_NAME}"
|
| 34 |
+
docker rm "${CONTAINER_NAME}"
|
| 35 |
+
echo "[INFO] Container stopped and removed."
|
| 36 |
+
elif docker ps -a --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
|
| 37 |
+
echo "[INFO] Container '${CONTAINER_NAME}' exists (stopped) — removing..."
|
| 38 |
+
docker rm "${CONTAINER_NAME}"
|
| 39 |
+
fi
|
| 40 |
+
|
| 41 |
+
# ---------------------------------------------------------------------------
|
| 42 |
+
# 2. Flush system page cache
|
| 43 |
+
# Frees unified LPDDR5X memory before vLLM starts.
|
| 44 |
+
# ---------------------------------------------------------------------------
|
| 45 |
+
echo "[INFO] Flushing system page cache..."
|
| 46 |
+
sync
|
| 47 |
+
if echo 3 | sudo -n tee /proc/sys/vm/drop_caches > /dev/null 2>&1; then
|
| 48 |
+
echo "[INFO] Page cache flushed."
|
| 49 |
+
else
|
| 50 |
+
echo "[WARN] No sudo access — skipping drop_caches (run manually: echo 3 | sudo tee /proc/sys/vm/drop_caches)."
|
| 51 |
+
fi
|
| 52 |
+
|
| 53 |
+
# ---------------------------------------------------------------------------
|
| 54 |
+
# 3. Ensure HF cache directory exists
|
| 55 |
+
# ---------------------------------------------------------------------------
|
| 56 |
+
mkdir -p "${HF_CACHE_DIR}"
|
| 57 |
+
|
| 58 |
+
# ---------------------------------------------------------------------------
|
| 59 |
+
# 4. Build optional HF_TOKEN env flag
|
| 60 |
+
# ---------------------------------------------------------------------------
|
| 61 |
+
HF_TOKEN_FLAG=""
|
| 62 |
+
if [[ -n "${HF_TOKEN}" ]]; then
|
| 63 |
+
HF_TOKEN_FLAG="-e HF_TOKEN=${HF_TOKEN}"
|
| 64 |
+
fi
|
| 65 |
+
|
| 66 |
+
# ---------------------------------------------------------------------------
|
| 67 |
+
# 5. Start the vLLM container
|
| 68 |
+
#
|
| 69 |
+
# Key decisions:
|
| 70 |
+
# vllm/vllm-openai:cu130-nightly
|
| 71 |
+
# Includes native qwen3_next support.
|
| 72 |
+
#
|
| 73 |
+
# No --reasoning-parser
|
| 74 |
+
# Qwen3 reasoning parser puts thinking in "reasoning" field and returns
|
| 75 |
+
# content=null when max_tokens is exhausted before thinking ends.
|
| 76 |
+
# Clients like Cline don't send chat_template_kwargs to disable thinking,
|
| 77 |
+
# so they receive content=null and fail. Without the parser, all output
|
| 78 |
+
# (including <think> blocks) goes into "content" — Cline works correctly.
|
| 79 |
+
#
|
| 80 |
+
# VLLM_NVFP4_GEMM_BACKEND=marlin
|
| 81 |
+
# SM12.1 (GB10) has no native CUTLASS FP4 kernel.
|
| 82 |
+
# Marlin handles NVFP4 W4A16 GEMM — required for correct operation.
|
| 83 |
+
#
|
| 84 |
+
# VLLM_USE_FLASHINFER_MOE_FP4=0
|
| 85 |
+
# FlashInfer MoE FP4 path is not supported on GB10 SM12.1.
|
| 86 |
+
#
|
| 87 |
+
# Quantization (auto-detected from config.json)
|
| 88 |
+
# quant_method: compressed-tensors, format: nvfp4-pack-quantized.
|
| 89 |
+
# vLLM reads this automatically — no --quantization flag needed.
|
| 90 |
+
# All Linear layers quantized to NVFP4 except: DeltaNet linear_attn,
|
| 91 |
+
# MoE router gates, shared_expert_gate, lm_head (all kept in BF16).
|
| 92 |
+
#
|
| 93 |
+
# --kv-cache-dtype fp8
|
| 94 |
+
# FP8 KV cache tested by the model author up to 128K context.
|
| 95 |
+
# Only applies to the 12 full-attention layers (not DeltaNet layers).
|
| 96 |
+
#
|
| 97 |
+
# --gpu-memory-utilization 0.55
|
| 98 |
+
# Unified memory (128 GB pool). 0.55 * 128 GB = 70 GB for vLLM.
|
| 99 |
+
# Weights: ~44 GB. KV cache: ~23 GB — sufficient for concurrent interactive use.
|
| 100 |
+
# Sized for combination runs: coder-next+llama31-8b=0.70, coder-next+nano=0.85.
|
| 101 |
+
#
|
| 102 |
+
# --max-model-len 131072
|
| 103 |
+
# Tested by the model author. Model supports up to 262144 natively
|
| 104 |
+
# (large rope_theta, no rope_scaling needed).
|
| 105 |
+
#
|
| 106 |
+
# --max-num-seqs 8
|
| 107 |
+
# Max concurrent requests (3B active params — latent MoE).
|
| 108 |
+
#
|
| 109 |
+
# --max-cudagraph-capture-size 128
|
| 110 |
+
# Limits CUDA graph capture batch sizes for stability on GB10.
|
| 111 |
+
#
|
| 112 |
+
# --reasoning-parser qwen3
|
| 113 |
+
# Qwen3-Next uses <think>...</think> chain-of-thought (same as Qwen3).
|
| 114 |
+
# Reasoning is ON by default; disable per-request:
|
| 115 |
+
# chat_template_kwargs={"enable_thinking": false}
|
| 116 |
+
#
|
| 117 |
+
# --host 0.0.0.0
|
| 118 |
+
# Required for LAN access (e.g. Cline running on a different machine).
|
| 119 |
+
# ---------------------------------------------------------------------------
|
| 120 |
+
echo "[INFO] Starting container '${CONTAINER_NAME}'..."
|
| 121 |
+
|
| 122 |
+
# shellcheck disable=SC2086
|
| 123 |
+
docker run -d \
|
| 124 |
+
--name "${CONTAINER_NAME}" \
|
| 125 |
+
--runtime=nvidia \
|
| 126 |
+
--gpus all \
|
| 127 |
+
-p 0.0.0.0:"${PORT}":"${PORT}" \
|
| 128 |
+
-v "${HF_CACHE_DIR}:/root/.cache/huggingface" \
|
| 129 |
+
--shm-size=16g \
|
| 130 |
+
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
|
| 131 |
+
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
|
| 132 |
+
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
|
| 133 |
+
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
|
| 134 |
+
${HF_TOKEN_FLAG} \
|
| 135 |
+
"${IMAGE}" \
|
| 136 |
+
"${MODEL}" \
|
| 137 |
+
--dtype auto \
|
| 138 |
+
--gpu-memory-utilization 0.90 \
|
| 139 |
+
--kv-cache-dtype fp8 \
|
| 140 |
+
--max-model-len "${MAX_MODEL_LEN}" \
|
| 141 |
+
--max-num-seqs 64 \
|
| 142 |
+
--max-num-batched-tokens 8192 \
|
| 143 |
+
--attention-backend flashinfer \
|
| 144 |
+
--enable-prefix-caching \
|
| 145 |
+
--enable-chunked-prefill \
|
| 146 |
+
--enable-auto-tool-choice \
|
| 147 |
+
--tool-call-parser qwen3_coder \
|
| 148 |
+
--host 0.0.0.0 \
|
| 149 |
+
--port "${PORT}"
|
| 150 |
+
|
| 151 |
+
echo "[INFO] Container started (detached). Waiting for API to become ready..."
|
| 152 |
+
echo "[INFO] Follow logs: docker logs -f ${CONTAINER_NAME}"
|
| 153 |
+
echo ""
|
| 154 |
+
|
| 155 |
+
# ---------------------------------------------------------------------------
|
| 156 |
+
# 6. Wait for API readiness (up to 15 minutes — large model download)
|
| 157 |
+
# ---------------------------------------------------------------------------
|
| 158 |
+
HEALTH_URL="http://localhost:${PORT}/health"
|
| 159 |
+
MAX_WAIT=900
|
| 160 |
+
INTERVAL=10
|
| 161 |
+
elapsed=0
|
| 162 |
+
|
| 163 |
+
while true; do
|
| 164 |
+
if curl -sf "${HEALTH_URL}" > /dev/null 2>&1; then
|
| 165 |
+
echo ""
|
| 166 |
+
echo "[OK] vLLM API is ready!"
|
| 167 |
+
echo "[OK] OpenAI-compatible endpoint: http://0.0.0.0:${PORT}/v1"
|
| 168 |
+
echo "[OK] Cline configuration:"
|
| 169 |
+
echo " Base URL : http://<spark-ip>:${PORT}/v1"
|
| 170 |
+
echo " Model ID : ${MODEL}"
|
| 171 |
+
echo " API Key : none"
|
| 172 |
+
break
|
| 173 |
+
fi
|
| 174 |
+
|
| 175 |
+
if ! docker ps --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
|
| 176 |
+
echo ""
|
| 177 |
+
echo "[ERROR] Container crashed during model load!"
|
| 178 |
+
echo " Last logs:"
|
| 179 |
+
docker logs --tail 50 "${CONTAINER_NAME}" 2>/dev/null || true
|
| 180 |
+
exit 1
|
| 181 |
+
fi
|
| 182 |
+
|
| 183 |
+
if [[ ${elapsed} -ge ${MAX_WAIT} ]]; then
|
| 184 |
+
echo ""
|
| 185 |
+
echo "[ERROR] API did not respond within ${MAX_WAIT}s."
|
| 186 |
+
echo " Check logs: docker logs ${CONTAINER_NAME}"
|
| 187 |
+
exit 1
|
| 188 |
+
fi
|
| 189 |
+
|
| 190 |
+
printf "."
|
| 191 |
+
sleep "${INTERVAL}"
|
| 192 |
+
elapsed=$(( elapsed + INTERVAL ))
|
| 193 |
+
done
|
| 194 |
+
|
| 195 |
+
# ---------------------------------------------------------------------------
|
| 196 |
+
# 7. Print recent logs
|
| 197 |
+
# ---------------------------------------------------------------------------
|
| 198 |
+
echo ""
|
| 199 |
+
echo "[INFO] Recent container logs:"
|
| 200 |
+
docker logs --tail 20 "${CONTAINER_NAME}"
|
test-api.sh
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# test-api.sh — smoke tests for the vLLM API
|
| 3 |
+
# Usage:
|
| 4 |
+
# bash test-api.sh # test localhost:8000
|
| 5 |
+
# bash test-api.sh 192.168.1.50 # test remote host
|
| 6 |
+
# bash test-api.sh 192.168.1.50 8080 # remote host, custom port
|
| 7 |
+
|
| 8 |
+
set -euo pipefail
|
| 9 |
+
|
| 10 |
+
HOST="${1:-localhost}"
|
| 11 |
+
PORT="${2:-8000}"
|
| 12 |
+
BASE_URL="http://${HOST}:${PORT}/v1"
|
| 13 |
+
|
| 14 |
+
if [[ -t 1 ]]; then
|
| 15 |
+
GREEN="\033[0;32m"; RED="\033[0;31m"; YELLOW="\033[0;33m"; NC="\033[0m"
|
| 16 |
+
else
|
| 17 |
+
GREEN=""; RED=""; YELLOW=""; NC=""
|
| 18 |
+
fi
|
| 19 |
+
|
| 20 |
+
ok() { echo -e "${GREEN}[OK]${NC} $*"; }
|
| 21 |
+
fail() { echo -e "${RED}[FAIL]${NC} $*"; }
|
| 22 |
+
info() { echo -e "${YELLOW}[INFO]${NC} $*"; }
|
| 23 |
+
|
| 24 |
+
echo "============================================================"
|
| 25 |
+
echo " vLLM API smoke tests — ${BASE_URL}"
|
| 26 |
+
echo "============================================================"
|
| 27 |
+
echo ""
|
| 28 |
+
|
| 29 |
+
# ---------------------------------------------------------------------------
|
| 30 |
+
# Test 1: Health endpoint
|
| 31 |
+
# ---------------------------------------------------------------------------
|
| 32 |
+
info "Test 1: /health"
|
| 33 |
+
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "${BASE_URL%/v1}/health")
|
| 34 |
+
if [[ "${HTTP_CODE}" == "200" ]]; then
|
| 35 |
+
ok "/health returned HTTP 200"
|
| 36 |
+
else
|
| 37 |
+
fail "/health returned HTTP ${HTTP_CODE} (server may still be loading)"
|
| 38 |
+
fi
|
| 39 |
+
echo ""
|
| 40 |
+
|
| 41 |
+
# ---------------------------------------------------------------------------
|
| 42 |
+
# Test 2: Model list
|
| 43 |
+
# ---------------------------------------------------------------------------
|
| 44 |
+
info "Test 2: GET /v1/models"
|
| 45 |
+
MODELS_RESPONSE=$(curl -s "${BASE_URL}/models")
|
| 46 |
+
echo "${MODELS_RESPONSE}" | python3 -m json.tool 2>/dev/null || echo "${MODELS_RESPONSE}"
|
| 47 |
+
|
| 48 |
+
MODEL_ID=$(echo "${MODELS_RESPONSE}" | python3 -c \
|
| 49 |
+
"import sys,json; data=json.load(sys.stdin); print(data['data'][0]['id'])" 2>/dev/null || echo "")
|
| 50 |
+
|
| 51 |
+
if [[ -n "${MODEL_ID}" ]]; then
|
| 52 |
+
ok "Model loaded: ${MODEL_ID}"
|
| 53 |
+
else
|
| 54 |
+
fail "Could not parse model list"
|
| 55 |
+
MODEL_ID="GadflyII/Qwen3-Coder-Next-NVFP4"
|
| 56 |
+
fi
|
| 57 |
+
echo ""
|
| 58 |
+
|
| 59 |
+
# ---------------------------------------------------------------------------
|
| 60 |
+
# Test 3: Chat completion (reasoning off)
|
| 61 |
+
# ---------------------------------------------------------------------------
|
| 62 |
+
info "Test 3: POST /v1/chat/completions (reasoning off)"
|
| 63 |
+
RESPONSE=$(curl -s \
|
| 64 |
+
-X POST "${BASE_URL}/chat/completions" \
|
| 65 |
+
-H "Content-Type: application/json" \
|
| 66 |
+
-d "{
|
| 67 |
+
\"model\": \"${MODEL_ID}\",
|
| 68 |
+
\"messages\": [{\"role\": \"user\", \"content\": \"Reply in one sentence: what is the capital of France?\"}],
|
| 69 |
+
\"max_tokens\": 60,
|
| 70 |
+
\"temperature\": 0.1,
|
| 71 |
+
\"chat_template_kwargs\": {\"enable_thinking\": false}
|
| 72 |
+
}")
|
| 73 |
+
|
| 74 |
+
CONTENT=$(echo "${RESPONSE}" | python3 -c \
|
| 75 |
+
"import sys,json; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])" 2>/dev/null || echo "")
|
| 76 |
+
|
| 77 |
+
if [[ -n "${CONTENT}" ]]; then
|
| 78 |
+
ok "Chat completion works."
|
| 79 |
+
echo " >> ${CONTENT}"
|
| 80 |
+
else
|
| 81 |
+
fail "No response"
|
| 82 |
+
echo "${RESPONSE}" | python3 -m json.tool 2>/dev/null || echo "${RESPONSE}"
|
| 83 |
+
fi
|
| 84 |
+
echo ""
|
| 85 |
+
|
| 86 |
+
# ---------------------------------------------------------------------------
|
| 87 |
+
# Test 4: Chat completion (reasoning on)
|
| 88 |
+
# ---------------------------------------------------------------------------
|
| 89 |
+
info "Test 4: POST /v1/chat/completions (reasoning on)"
|
| 90 |
+
RESPONSE=$(curl -s \
|
| 91 |
+
-X POST "${BASE_URL}/chat/completions" \
|
| 92 |
+
-H "Content-Type: application/json" \
|
| 93 |
+
-d "{
|
| 94 |
+
\"model\": \"${MODEL_ID}\",
|
| 95 |
+
\"messages\": [{\"role\": \"user\", \"content\": \"What is 17 * 23? Show your work.\"}],
|
| 96 |
+
\"max_tokens\": 1000,
|
| 97 |
+
\"temperature\": 0.1,
|
| 98 |
+
\"chat_template_kwargs\": {\"enable_thinking\": true}
|
| 99 |
+
}")
|
| 100 |
+
|
| 101 |
+
CONTENT=$(echo "${RESPONSE}" | python3 -c \
|
| 102 |
+
"import sys,json; r=json.load(sys.stdin); m=r['choices'][0]['message']; thinking=m.get('reasoning_content') or m.get('reasoning',''); print('thinking:', repr(thinking)[:80], '\nanswer:', m.get('content',''))" \
|
| 103 |
+
2>/dev/null || echo "")
|
| 104 |
+
|
| 105 |
+
if [[ -n "${CONTENT}" ]]; then
|
| 106 |
+
ok "Reasoning mode works."
|
| 107 |
+
echo "${CONTENT}"
|
| 108 |
+
else
|
| 109 |
+
fail "No response from reasoning mode"
|
| 110 |
+
fi
|
| 111 |
+
echo ""
|
| 112 |
+
|
| 113 |
+
# ---------------------------------------------------------------------------
|
| 114 |
+
# Test 5: Code generation
|
| 115 |
+
# ---------------------------------------------------------------------------
|
| 116 |
+
info "Test 5: Code generation"
|
| 117 |
+
RESPONSE=$(curl -s \
|
| 118 |
+
-X POST "${BASE_URL}/chat/completions" \
|
| 119 |
+
-H "Content-Type: application/json" \
|
| 120 |
+
-d "{
|
| 121 |
+
\"model\": \"${MODEL_ID}\",
|
| 122 |
+
\"messages\": [{\"role\": \"user\", \"content\": \"Write a Python function that returns the nth Fibonacci number using memoization.\"}],
|
| 123 |
+
\"max_tokens\": 300,
|
| 124 |
+
\"temperature\": 0.1,
|
| 125 |
+
\"chat_template_kwargs\": {\"enable_thinking\": false}
|
| 126 |
+
}")
|
| 127 |
+
|
| 128 |
+
CODE=$(echo "${RESPONSE}" | python3 -c \
|
| 129 |
+
"import sys,json; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])" 2>/dev/null || echo "")
|
| 130 |
+
|
| 131 |
+
if [[ -n "${CODE}" ]]; then
|
| 132 |
+
ok "Code generation works."
|
| 133 |
+
echo "${CODE}" | head -10
|
| 134 |
+
echo " ..."
|
| 135 |
+
else
|
| 136 |
+
fail "No code response"
|
| 137 |
+
fi
|
| 138 |
+
echo ""
|
| 139 |
+
|
| 140 |
+
# ---------------------------------------------------------------------------
|
| 141 |
+
# Summary
|
| 142 |
+
# ---------------------------------------------------------------------------
|
| 143 |
+
echo "============================================================"
|
| 144 |
+
echo " Cline configuration (OpenAI Compatible provider):"
|
| 145 |
+
echo ""
|
| 146 |
+
echo " Base URL : ${BASE_URL}"
|
| 147 |
+
echo " Model ID : ${MODEL_ID}"
|
| 148 |
+
echo " API Key : none (any non-empty string)"
|
| 149 |
+
echo "============================================================"
|