gdubicki commited on
Commit
327350d
·
verified ·
1 Parent(s): 6f5e2e4

Switch to saricles/Qwen3-Coder-Next-NVFP4-GB10 with GB10-optimized settings

Browse files
Files changed (4) hide show
  1. README.md +152 -0
  2. docker-run.sh +100 -0
  3. start-qwen3-coder-next.sh +200 -0
  4. test-api.sh +149 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # saricles/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10)
2
+
3
+ Runs [`saricles/Qwen3-Coder-Next-NVFP4-GB10`](https://huggingface.co/saricles/Qwen3-Coder-Next-NVFP4-GB10) via vLLM with an OpenAI-compatible API endpoint.
4
+ Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X).
5
+
6
+ ## Model overview
7
+
8
+ | Property | Value |
9
+ |----------|-------|
10
+ | Architecture | `qwen3_next` (Hybrid DeltaNet linear attention + full attention + latent MoE) |
11
+ | Base model | `Qwen/Qwen3-Coder-Next` |
12
+ | Parameters | 80B total, **3B active** per token (512 experts, 10 active + 1 shared) |
13
+ | Layers | 48 (pattern: 3× DeltaNet linear → 1× full attention, repeating → 12 full attention) |
14
+ | Quantization | NVFP4 via `llmcompressor` + `compressed-tensors`; all 512 MoE experts calibrated |
15
+ | Kept in BF16 | `lm_head`, `embed_tokens`, `linear_attn` layers, `mlp.gate`, `mlp.shared_expert_gate` |
16
+ | KV cache | FP8 (only for the 12 full-attention layers) |
17
+ | Model size | ~45 GB (70% reduction from ~149 GB BF16) |
18
+ | Max context | 262,144 tokens (native; tested with FP8 KV cache by saricles) |
19
+ | Reasoning | Built-in chain-of-thought (`<think>` tags), ON by default |
20
+
21
+ ## Model features
22
+
23
+ | Feature | Support |
24
+ |---------|---------|
25
+ | Tool calling | ✅ Yes (`--tool-call-parser qwen3_coder`) |
26
+ | Reasoning / thinking mode | ✅ Yes (ON by default, toggleable via `enable_thinking`) |
27
+ | Languages | multilingual (code-focused) |
28
+
29
+ ## Performance
30
+
31
+ Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X):
32
+
33
+ | Metric | Value |
34
+ |--------|-------|
35
+ | Throughput | ~61 tok/s |
36
+ | Max context | 262,144 tokens |
37
+ | KV cache concurrency | **31.65×** at 262K tokens (DeltaNet has no KV cache) |
38
+
39
+ Comparison across models on GB10:
40
+
41
+ | Model | Active params | tok/s |
42
+ |-------|--------------|-------|
43
+ | Gemma-4-31B-IT-NVFP4 (dense) | 31B | ~7 |
44
+ | Qwen3-32B-NVFP4 (dense) | 32.8B | ~11 |
45
+ | Nemotron-3-Super-120B-A12B-NVFP4 | 12B | ~16 |
46
+ | **Qwen3-Coder-Next-NVFP4-GB10** | **3B** | **~61** |
47
+ | Nemotron-3-Nano-30B-A3B-NVFP4 | 3B | ~61 |
48
+
49
+ Qwen3-Coder-Next is 80B total but 3B active — same throughput as Nemotron-3-Nano with full 262K context.
50
+
51
+ ## Quick start
52
+
53
+ ```bash
54
+ # Required — model is gated on Hugging Face (accept license at saricles/Qwen3-Coder-Next-NVFP4-GB10 first):
55
+ export HF_TOKEN=hf_xxxx
56
+
57
+ bash start-qwen3-coder-next.sh
58
+ ```
59
+
60
+ The script will:
61
+ 1. Stop and remove any existing `qwen3-coder-next-vllm` container
62
+ 2. Flush the system page cache (frees unified memory before vLLM starts)
63
+ 3. Start the container in detached mode
64
+ 4. Poll `http://localhost:8000/health` until the API is ready (~45 GB download on first run)
65
+
66
+ ## Test the API
67
+
68
+ ```bash
69
+ bash test-api.sh # localhost
70
+ bash test-api.sh 192.168.x.x # remote host
71
+ ```
72
+
73
+ ## Reasoning (chain-of-thought)
74
+
75
+ Reasoning is **ON by default**. Toggle per request:
76
+
77
+ ```bash
78
+ # Reasoning OFF
79
+ curl -s -X POST http://localhost:8000/v1/chat/completions \
80
+ -H "Content-Type: application/json" \
81
+ -d '{"model":"saricles/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}'
82
+
83
+ # Reasoning ON (default)
84
+ curl -s -X POST http://localhost:8000/v1/chat/completions \
85
+ -H "Content-Type: application/json" \
86
+ -d '{"model":"saricles/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}'
87
+ ```
88
+
89
+ ## Cline configuration
90
+
91
+ 1. Open Cline settings (sidebar icon → gear icon)
92
+ 2. Fill in the fields:
93
+
94
+ | Field | Value |
95
+ |-------|-------|
96
+ | Provider | OpenAI Compatible |
97
+ | Base URL | `http://<spark-ip>:8000/v1` |
98
+ | Model ID | `saricles/Qwen3-Coder-Next-NVFP4-GB10` |
99
+ | API Key | `dummy` (any non-empty string) |
100
+
101
+ ## Files
102
+
103
+ | File | Purpose |
104
+ |------|---------|
105
+ | `start-qwen3-coder-next.sh` | Full launcher: stop, cache flush, docker run, health poll |
106
+ | `docker-run.sh` | Bare `docker run` command with comments, for reference |
107
+ | `test-api.sh` | curl smoke tests: health, model list, chat completion, reasoning, code generation |
108
+
109
+ ## How it works
110
+
111
+ The `vllm/vllm-openai:cu130-nightly` image includes native `qwen3_next` support.
112
+
113
+ **Architecture**: Hybrid of:
114
+ - **DeltaNet** selective linear attention layers (36 of 48 layers) — subquadratic in sequence length, no KV cache
115
+ - **Full attention** layers (12 of 48) — standard transformer attention with KV cache
116
+ - **Latent MoE** (80B total, 3B active per token — same throughput profile as Nano)
117
+
118
+ Quantization is via `compressed-tensors` (llmcompressor) with `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1`
119
+ — all 512 MoE experts are calibrated, not just sampled.
120
+ vLLM auto-detects quantization from `quantization_config` in `config.json`; no `--quantization` flag needed.
121
+
122
+ ### Key environment variables
123
+
124
+ | Variable | Reason |
125
+ |----------|--------|
126
+ | `VLLM_NVFP4_GEMM_BACKEND=marlin` | SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts |
127
+ | `VLLM_TEST_FORCE_FP8_MARLIN=1` | Forces FP8 Marlin path on GB10 SM12.1 |
128
+ | `VLLM_USE_FLASHINFER_MOE_FP4=0` | FlashInfer MoE FP4 path not supported on GB10 SM12.1 |
129
+ | `VLLM_MARLIN_USE_ATOMIC_ADD=1` | GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1 |
130
+
131
+ ### Key flags
132
+
133
+ | Flag | Reason |
134
+ |------|--------|
135
+ | `--dtype auto` | BF16 for non-quantized layers (DeltaNet, router gates, lm_head) |
136
+ | `--kv-cache-dtype fp8` | FP8 KV cache; applies to 12 full-attention layers only |
137
+ | `--gpu-memory-utilization 0.90` | 0.90 × 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky) |
138
+ | `--max-model-len 262144` | Full native context; tested by saricles with FP8 KV cache |
139
+ | `--max-num-seqs 64` | Max concurrent requests |
140
+ | `--max-num-batched-tokens 8192` | Prevents OOM on long contexts |
141
+ | `--attention-backend flashinfer` | Required for FP8 KV cache + chunked prefill on GB10 |
142
+ | `--enable-prefix-caching` | Reuses KV cache for repeated prompt prefixes (system prompts) |
143
+ | `--enable-chunked-prefill` | Reduces memory spikes during long-prompt processing |
144
+ | `--tool-call-parser qwen3_coder` | OpenAI-compatible tool calling |
145
+
146
+ ## Requirements
147
+
148
+ - Docker with `nvidia-container-toolkit`
149
+ - Image: `vllm/vllm-openai:cu130-nightly`
150
+ - HF token with access to `saricles/Qwen3-Coder-Next-NVFP4-GB10` (gated — accept license first)
151
+ - Model weights cached locally (auto-downloaded on first run, ~45 GB):
152
+ `~/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`
docker-run.sh ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # docker-run.sh — bare docker run command (without start-qwen3-coder-next.sh lifecycle logic)
3
+ # Useful for manual testing or embedding in other scripts.
4
+ #
5
+ # Usage: bash docker-run.sh
6
+ #
7
+ # Environment variables:
8
+ # HF_TOKEN — optional Hugging Face token (required for gated models)
9
+ # HF_CACHE — local weight cache path (default: ~/.cache/huggingface)
10
+
11
+ set -euo pipefail
12
+
13
+ HF_CACHE="${HF_CACHE:-${HOME}/.cache/huggingface}"
14
+ mkdir -p "${HF_CACHE}"
15
+
16
+ docker run \
17
+ --name qwen3-coder-next-vllm \
18
+ --rm \
19
+ --runtime=nvidia \
20
+ --gpus all \
21
+ -p 0.0.0.0:8000:8000 \
22
+ -v "${HF_CACHE}:/root/.cache/huggingface" \
23
+ --shm-size=32g \
24
+ -e VLLM_NVFP4_GEMM_BACKEND=marlin \
25
+ -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
26
+ -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
27
+ -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
28
+ ${HF_TOKEN:+-e HF_TOKEN="${HF_TOKEN}"} \
29
+ vllm/vllm-openai:cu130-nightly \
30
+ saricles/Qwen3-Coder-Next-NVFP4-GB10 \
31
+ --dtype auto \
32
+ --gpu-memory-utilization 0.90 \
33
+ --kv-cache-dtype fp8 \
34
+ --max-model-len 262144 \
35
+ --max-num-seqs 64 \
36
+ --max-num-batched-tokens 8192 \
37
+ --attention-backend flashinfer \
38
+ --enable-prefix-caching \
39
+ --enable-chunked-prefill \
40
+ --enable-auto-tool-choice \
41
+ --tool-call-parser qwen3_coder \
42
+ --host 0.0.0.0 \
43
+ --port 8000
44
+
45
+ # ---------------------------------------------------------------------------
46
+ # Flag reference:
47
+ #
48
+ # vllm/vllm-openai:cu130-nightly
49
+ # Native qwen3_next support (vLLM 0.19+).
50
+ #
51
+ # VLLM_NVFP4_GEMM_BACKEND=marlin
52
+ # SM12.1 (GB10) has no native CUTLASS FP4 kernel.
53
+ # Marlin handles NVFP4 W4A16 GEMM — 15% faster than CUTLASS for 512 experts.
54
+ #
55
+ # VLLM_TEST_FORCE_FP8_MARLIN=1
56
+ # Forces FP8 Marlin path on GB10 SM12.1.
57
+ #
58
+ # VLLM_USE_FLASHINFER_MOE_FP4=0
59
+ # FlashInfer MoE FP4 path not supported on GB10 SM12.1.
60
+ #
61
+ # VLLM_MARLIN_USE_ATOMIC_ADD=1
62
+ # GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1.
63
+ #
64
+ # No --quantization flag
65
+ # compressed-tensors format is auto-detected from config.json.
66
+ #
67
+ # --dtype auto
68
+ # BF16 for non-quantized layers (DeltaNet linear_attn, router gates, lm_head).
69
+ #
70
+ # --gpu-memory-utilization 0.90
71
+ # 0.90 × 128 GB = 115 GB for vLLM. Weights: ~43 GB. KV cache: ~72 GB.
72
+ # Safe limit per saricles testing (0.93 is risky).
73
+ #
74
+ # --kv-cache-dtype fp8
75
+ # FP8 KV cache. Only applies to the 12 full-attention layers (not DeltaNet).
76
+ #
77
+ # --max-model-len 262144
78
+ # Full native context. Tested with FP8 KV cache by saricles.
79
+ #
80
+ # --max-num-seqs 64
81
+ # Max concurrent requests.
82
+ #
83
+ # --max-num-batched-tokens 8192
84
+ # Limits tokens per batch — prevents OOM on long contexts.
85
+ #
86
+ # --attention-backend flashinfer
87
+ # Required for FP8 KV cache + chunked prefill on GB10.
88
+ #
89
+ # --enable-prefix-caching
90
+ # Reuses KV cache for repeated prompt prefixes (system prompts, etc.).
91
+ #
92
+ # --enable-chunked-prefill
93
+ # Reduces memory spikes during long-prompt processing.
94
+ #
95
+ # --enable-auto-tool-choice --tool-call-parser qwen3_coder
96
+ # Enables OpenAI-compatible tool calling for this model.
97
+ #
98
+ # --host 0.0.0.0 --port 8000
99
+ # OpenAI-compatible REST API, reachable from LAN.
100
+ # ---------------------------------------------------------------------------
start-qwen3-coder-next.sh ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # start-qwen3-coder-next.sh — launch GadflyII/Qwen3-Coder-Next-NVFP4 via vLLM on DGX Spark (GB10)
3
+ # Requirements: Docker with nvidia-container-toolkit, vllm/vllm-openai:cu130-nightly
4
+ #
5
+ # Usage:
6
+ # bash start-qwen3-coder-next.sh
7
+ # HF_TOKEN=hf_xxxx bash start-qwen3-coder-next.sh
8
+ #
9
+ # Environment variables:
10
+ # HF_TOKEN — HF token (required when the model is gated on huggingface.co)
11
+ # HF_CACHE_DIR — local weight cache directory (default: ~/.cache/huggingface)
12
+ # MAX_MODEL_LEN — context length (default: 131072; model supports up to 262144)
13
+
14
+ set -euo pipefail
15
+
16
+ # ---------------------------------------------------------------------------
17
+ # Configuration
18
+ # ---------------------------------------------------------------------------
19
+ CONTAINER_NAME="qwen3-coder-next-vllm"
20
+ IMAGE="vllm/vllm-openai:cu130-nightly"
21
+ MODEL="saricles/Qwen3-Coder-Next-NVFP4-GB10"
22
+ PORT=8000
23
+ MAX_MODEL_LEN="${MAX_MODEL_LEN:-262144}"
24
+
25
+ HF_CACHE_DIR="${HF_CACHE_DIR:-${HOME}/.cache/huggingface}"
26
+ HF_TOKEN="${HF_TOKEN:-}"
27
+
28
+ # ---------------------------------------------------------------------------
29
+ # 1. Stop existing container (if running or stopped)
30
+ # ---------------------------------------------------------------------------
31
+ if docker ps --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
32
+ echo "[INFO] Container '${CONTAINER_NAME}' is running — stopping..."
33
+ docker stop "${CONTAINER_NAME}"
34
+ docker rm "${CONTAINER_NAME}"
35
+ echo "[INFO] Container stopped and removed."
36
+ elif docker ps -a --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
37
+ echo "[INFO] Container '${CONTAINER_NAME}' exists (stopped) — removing..."
38
+ docker rm "${CONTAINER_NAME}"
39
+ fi
40
+
41
+ # ---------------------------------------------------------------------------
42
+ # 2. Flush system page cache
43
+ # Frees unified LPDDR5X memory before vLLM starts.
44
+ # ---------------------------------------------------------------------------
45
+ echo "[INFO] Flushing system page cache..."
46
+ sync
47
+ if echo 3 | sudo -n tee /proc/sys/vm/drop_caches > /dev/null 2>&1; then
48
+ echo "[INFO] Page cache flushed."
49
+ else
50
+ echo "[WARN] No sudo access — skipping drop_caches (run manually: echo 3 | sudo tee /proc/sys/vm/drop_caches)."
51
+ fi
52
+
53
+ # ---------------------------------------------------------------------------
54
+ # 3. Ensure HF cache directory exists
55
+ # ---------------------------------------------------------------------------
56
+ mkdir -p "${HF_CACHE_DIR}"
57
+
58
+ # ---------------------------------------------------------------------------
59
+ # 4. Build optional HF_TOKEN env flag
60
+ # ---------------------------------------------------------------------------
61
+ HF_TOKEN_FLAG=""
62
+ if [[ -n "${HF_TOKEN}" ]]; then
63
+ HF_TOKEN_FLAG="-e HF_TOKEN=${HF_TOKEN}"
64
+ fi
65
+
66
+ # ---------------------------------------------------------------------------
67
+ # 5. Start the vLLM container
68
+ #
69
+ # Key decisions:
70
+ # vllm/vllm-openai:cu130-nightly
71
+ # Includes native qwen3_next support.
72
+ #
73
+ # No --reasoning-parser
74
+ # Qwen3 reasoning parser puts thinking in "reasoning" field and returns
75
+ # content=null when max_tokens is exhausted before thinking ends.
76
+ # Clients like Cline don't send chat_template_kwargs to disable thinking,
77
+ # so they receive content=null and fail. Without the parser, all output
78
+ # (including <think> blocks) goes into "content" — Cline works correctly.
79
+ #
80
+ # VLLM_NVFP4_GEMM_BACKEND=marlin
81
+ # SM12.1 (GB10) has no native CUTLASS FP4 kernel.
82
+ # Marlin handles NVFP4 W4A16 GEMM — required for correct operation.
83
+ #
84
+ # VLLM_USE_FLASHINFER_MOE_FP4=0
85
+ # FlashInfer MoE FP4 path is not supported on GB10 SM12.1.
86
+ #
87
+ # Quantization (auto-detected from config.json)
88
+ # quant_method: compressed-tensors, format: nvfp4-pack-quantized.
89
+ # vLLM reads this automatically — no --quantization flag needed.
90
+ # All Linear layers quantized to NVFP4 except: DeltaNet linear_attn,
91
+ # MoE router gates, shared_expert_gate, lm_head (all kept in BF16).
92
+ #
93
+ # --kv-cache-dtype fp8
94
+ # FP8 KV cache tested by the model author up to 128K context.
95
+ # Only applies to the 12 full-attention layers (not DeltaNet layers).
96
+ #
97
+ # --gpu-memory-utilization 0.55
98
+ # Unified memory (128 GB pool). 0.55 * 128 GB = 70 GB for vLLM.
99
+ # Weights: ~44 GB. KV cache: ~23 GB — sufficient for concurrent interactive use.
100
+ # Sized for combination runs: coder-next+llama31-8b=0.70, coder-next+nano=0.85.
101
+ #
102
+ # --max-model-len 131072
103
+ # Tested by the model author. Model supports up to 262144 natively
104
+ # (large rope_theta, no rope_scaling needed).
105
+ #
106
+ # --max-num-seqs 8
107
+ # Max concurrent requests (3B active params — latent MoE).
108
+ #
109
+ # --max-cudagraph-capture-size 128
110
+ # Limits CUDA graph capture batch sizes for stability on GB10.
111
+ #
112
+ # --reasoning-parser qwen3
113
+ # Qwen3-Next uses <think>...</think> chain-of-thought (same as Qwen3).
114
+ # Reasoning is ON by default; disable per-request:
115
+ # chat_template_kwargs={"enable_thinking": false}
116
+ #
117
+ # --host 0.0.0.0
118
+ # Required for LAN access (e.g. Cline running on a different machine).
119
+ # ---------------------------------------------------------------------------
120
+ echo "[INFO] Starting container '${CONTAINER_NAME}'..."
121
+
122
+ # shellcheck disable=SC2086
123
+ docker run -d \
124
+ --name "${CONTAINER_NAME}" \
125
+ --runtime=nvidia \
126
+ --gpus all \
127
+ -p 0.0.0.0:"${PORT}":"${PORT}" \
128
+ -v "${HF_CACHE_DIR}:/root/.cache/huggingface" \
129
+ --shm-size=16g \
130
+ -e VLLM_NVFP4_GEMM_BACKEND=marlin \
131
+ -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
132
+ -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
133
+ -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
134
+ ${HF_TOKEN_FLAG} \
135
+ "${IMAGE}" \
136
+ "${MODEL}" \
137
+ --dtype auto \
138
+ --gpu-memory-utilization 0.90 \
139
+ --kv-cache-dtype fp8 \
140
+ --max-model-len "${MAX_MODEL_LEN}" \
141
+ --max-num-seqs 64 \
142
+ --max-num-batched-tokens 8192 \
143
+ --attention-backend flashinfer \
144
+ --enable-prefix-caching \
145
+ --enable-chunked-prefill \
146
+ --enable-auto-tool-choice \
147
+ --tool-call-parser qwen3_coder \
148
+ --host 0.0.0.0 \
149
+ --port "${PORT}"
150
+
151
+ echo "[INFO] Container started (detached). Waiting for API to become ready..."
152
+ echo "[INFO] Follow logs: docker logs -f ${CONTAINER_NAME}"
153
+ echo ""
154
+
155
+ # ---------------------------------------------------------------------------
156
+ # 6. Wait for API readiness (up to 15 minutes — large model download)
157
+ # ---------------------------------------------------------------------------
158
+ HEALTH_URL="http://localhost:${PORT}/health"
159
+ MAX_WAIT=900
160
+ INTERVAL=10
161
+ elapsed=0
162
+
163
+ while true; do
164
+ if curl -sf "${HEALTH_URL}" > /dev/null 2>&1; then
165
+ echo ""
166
+ echo "[OK] vLLM API is ready!"
167
+ echo "[OK] OpenAI-compatible endpoint: http://0.0.0.0:${PORT}/v1"
168
+ echo "[OK] Cline configuration:"
169
+ echo " Base URL : http://<spark-ip>:${PORT}/v1"
170
+ echo " Model ID : ${MODEL}"
171
+ echo " API Key : none"
172
+ break
173
+ fi
174
+
175
+ if ! docker ps --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
176
+ echo ""
177
+ echo "[ERROR] Container crashed during model load!"
178
+ echo " Last logs:"
179
+ docker logs --tail 50 "${CONTAINER_NAME}" 2>/dev/null || true
180
+ exit 1
181
+ fi
182
+
183
+ if [[ ${elapsed} -ge ${MAX_WAIT} ]]; then
184
+ echo ""
185
+ echo "[ERROR] API did not respond within ${MAX_WAIT}s."
186
+ echo " Check logs: docker logs ${CONTAINER_NAME}"
187
+ exit 1
188
+ fi
189
+
190
+ printf "."
191
+ sleep "${INTERVAL}"
192
+ elapsed=$(( elapsed + INTERVAL ))
193
+ done
194
+
195
+ # ---------------------------------------------------------------------------
196
+ # 7. Print recent logs
197
+ # ---------------------------------------------------------------------------
198
+ echo ""
199
+ echo "[INFO] Recent container logs:"
200
+ docker logs --tail 20 "${CONTAINER_NAME}"
test-api.sh ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # test-api.sh — smoke tests for the vLLM API
3
+ # Usage:
4
+ # bash test-api.sh # test localhost:8000
5
+ # bash test-api.sh 192.168.1.50 # test remote host
6
+ # bash test-api.sh 192.168.1.50 8080 # remote host, custom port
7
+
8
+ set -euo pipefail
9
+
10
+ HOST="${1:-localhost}"
11
+ PORT="${2:-8000}"
12
+ BASE_URL="http://${HOST}:${PORT}/v1"
13
+
14
+ if [[ -t 1 ]]; then
15
+ GREEN="\033[0;32m"; RED="\033[0;31m"; YELLOW="\033[0;33m"; NC="\033[0m"
16
+ else
17
+ GREEN=""; RED=""; YELLOW=""; NC=""
18
+ fi
19
+
20
+ ok() { echo -e "${GREEN}[OK]${NC} $*"; }
21
+ fail() { echo -e "${RED}[FAIL]${NC} $*"; }
22
+ info() { echo -e "${YELLOW}[INFO]${NC} $*"; }
23
+
24
+ echo "============================================================"
25
+ echo " vLLM API smoke tests — ${BASE_URL}"
26
+ echo "============================================================"
27
+ echo ""
28
+
29
+ # ---------------------------------------------------------------------------
30
+ # Test 1: Health endpoint
31
+ # ---------------------------------------------------------------------------
32
+ info "Test 1: /health"
33
+ HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "${BASE_URL%/v1}/health")
34
+ if [[ "${HTTP_CODE}" == "200" ]]; then
35
+ ok "/health returned HTTP 200"
36
+ else
37
+ fail "/health returned HTTP ${HTTP_CODE} (server may still be loading)"
38
+ fi
39
+ echo ""
40
+
41
+ # ---------------------------------------------------------------------------
42
+ # Test 2: Model list
43
+ # ---------------------------------------------------------------------------
44
+ info "Test 2: GET /v1/models"
45
+ MODELS_RESPONSE=$(curl -s "${BASE_URL}/models")
46
+ echo "${MODELS_RESPONSE}" | python3 -m json.tool 2>/dev/null || echo "${MODELS_RESPONSE}"
47
+
48
+ MODEL_ID=$(echo "${MODELS_RESPONSE}" | python3 -c \
49
+ "import sys,json; data=json.load(sys.stdin); print(data['data'][0]['id'])" 2>/dev/null || echo "")
50
+
51
+ if [[ -n "${MODEL_ID}" ]]; then
52
+ ok "Model loaded: ${MODEL_ID}"
53
+ else
54
+ fail "Could not parse model list"
55
+ MODEL_ID="GadflyII/Qwen3-Coder-Next-NVFP4"
56
+ fi
57
+ echo ""
58
+
59
+ # ---------------------------------------------------------------------------
60
+ # Test 3: Chat completion (reasoning off)
61
+ # ---------------------------------------------------------------------------
62
+ info "Test 3: POST /v1/chat/completions (reasoning off)"
63
+ RESPONSE=$(curl -s \
64
+ -X POST "${BASE_URL}/chat/completions" \
65
+ -H "Content-Type: application/json" \
66
+ -d "{
67
+ \"model\": \"${MODEL_ID}\",
68
+ \"messages\": [{\"role\": \"user\", \"content\": \"Reply in one sentence: what is the capital of France?\"}],
69
+ \"max_tokens\": 60,
70
+ \"temperature\": 0.1,
71
+ \"chat_template_kwargs\": {\"enable_thinking\": false}
72
+ }")
73
+
74
+ CONTENT=$(echo "${RESPONSE}" | python3 -c \
75
+ "import sys,json; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])" 2>/dev/null || echo "")
76
+
77
+ if [[ -n "${CONTENT}" ]]; then
78
+ ok "Chat completion works."
79
+ echo " >> ${CONTENT}"
80
+ else
81
+ fail "No response"
82
+ echo "${RESPONSE}" | python3 -m json.tool 2>/dev/null || echo "${RESPONSE}"
83
+ fi
84
+ echo ""
85
+
86
+ # ---------------------------------------------------------------------------
87
+ # Test 4: Chat completion (reasoning on)
88
+ # ---------------------------------------------------------------------------
89
+ info "Test 4: POST /v1/chat/completions (reasoning on)"
90
+ RESPONSE=$(curl -s \
91
+ -X POST "${BASE_URL}/chat/completions" \
92
+ -H "Content-Type: application/json" \
93
+ -d "{
94
+ \"model\": \"${MODEL_ID}\",
95
+ \"messages\": [{\"role\": \"user\", \"content\": \"What is 17 * 23? Show your work.\"}],
96
+ \"max_tokens\": 1000,
97
+ \"temperature\": 0.1,
98
+ \"chat_template_kwargs\": {\"enable_thinking\": true}
99
+ }")
100
+
101
+ CONTENT=$(echo "${RESPONSE}" | python3 -c \
102
+ "import sys,json; r=json.load(sys.stdin); m=r['choices'][0]['message']; thinking=m.get('reasoning_content') or m.get('reasoning',''); print('thinking:', repr(thinking)[:80], '\nanswer:', m.get('content',''))" \
103
+ 2>/dev/null || echo "")
104
+
105
+ if [[ -n "${CONTENT}" ]]; then
106
+ ok "Reasoning mode works."
107
+ echo "${CONTENT}"
108
+ else
109
+ fail "No response from reasoning mode"
110
+ fi
111
+ echo ""
112
+
113
+ # ---------------------------------------------------------------------------
114
+ # Test 5: Code generation
115
+ # ---------------------------------------------------------------------------
116
+ info "Test 5: Code generation"
117
+ RESPONSE=$(curl -s \
118
+ -X POST "${BASE_URL}/chat/completions" \
119
+ -H "Content-Type: application/json" \
120
+ -d "{
121
+ \"model\": \"${MODEL_ID}\",
122
+ \"messages\": [{\"role\": \"user\", \"content\": \"Write a Python function that returns the nth Fibonacci number using memoization.\"}],
123
+ \"max_tokens\": 300,
124
+ \"temperature\": 0.1,
125
+ \"chat_template_kwargs\": {\"enable_thinking\": false}
126
+ }")
127
+
128
+ CODE=$(echo "${RESPONSE}" | python3 -c \
129
+ "import sys,json; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])" 2>/dev/null || echo "")
130
+
131
+ if [[ -n "${CODE}" ]]; then
132
+ ok "Code generation works."
133
+ echo "${CODE}" | head -10
134
+ echo " ..."
135
+ else
136
+ fail "No code response"
137
+ fi
138
+ echo ""
139
+
140
+ # ---------------------------------------------------------------------------
141
+ # Summary
142
+ # ---------------------------------------------------------------------------
143
+ echo "============================================================"
144
+ echo " Cline configuration (OpenAI Compatible provider):"
145
+ echo ""
146
+ echo " Base URL : ${BASE_URL}"
147
+ echo " Model ID : ${MODEL_ID}"
148
+ echo " API Key : none (any non-empty string)"
149
+ echo "============================================================"