Switch to saricles/Qwen3-Coder-Next-NVFP4-GB10 with GB10-optimized settings

Browse files

Files changed (4) hide show

README.md +152 -0
docker-run.sh +100 -0
start-qwen3-coder-next.sh +200 -0
test-api.sh +149 -0

README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# saricles/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10)
+Runs [`saricles/Qwen3-Coder-Next-NVFP4-GB10`](https://huggingface.co/saricles/Qwen3-Coder-Next-NVFP4-GB10) via vLLM with an OpenAI-compatible API endpoint.
+Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X).
+## Model overview
+| Property | Value |
+|----------|-------|
+| Architecture | `qwen3_next` (Hybrid DeltaNet linear attention + full attention + latent MoE) |
+| Base model | `Qwen/Qwen3-Coder-Next` |
+| Parameters | 80B total, **3B active** per token (512 experts, 10 active + 1 shared) |
+| Layers | 48 (pattern: 3× DeltaNet linear → 1× full attention, repeating → 12 full attention) |
+| Quantization | NVFP4 via `llmcompressor` + `compressed-tensors`; all 512 MoE experts calibrated |
+| Kept in BF16 | `lm_head`, `embed_tokens`, `linear_attn` layers, `mlp.gate`, `mlp.shared_expert_gate` |
+| KV cache | FP8 (only for the 12 full-attention layers) |
+| Model size | ~45 GB (70% reduction from ~149 GB BF16) |
+| Max context | 262,144 tokens (native; tested with FP8 KV cache by saricles) |
+| Reasoning | Built-in chain-of-thought (`<think>` tags), ON by default |
+## Model features
+| Feature | Support |
+|---------|---------|
+| Tool calling | ✅ Yes (`--tool-call-parser qwen3_coder`) |
+| Reasoning / thinking mode | ✅ Yes (ON by default, toggleable via `enable_thinking`) |
+| Languages | multilingual (code-focused) |
+## Performance
+Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X):
+| Metric | Value |
+|--------|-------|
+| Throughput | ~61 tok/s |
+| Max context | 262,144 tokens |
+| KV cache concurrency | **31.65×** at 262K tokens (DeltaNet has no KV cache) |
+Comparison across models on GB10:
+| Model | Active params | tok/s |
+|-------|--------------|-------|
+| Gemma-4-31B-IT-NVFP4 (dense) | 31B | ~7 |
+| Qwen3-32B-NVFP4 (dense) | 32.8B | ~11 |
+| Nemotron-3-Super-120B-A12B-NVFP4 | 12B | ~16 |
+| **Qwen3-Coder-Next-NVFP4-GB10** | **3B** | **~61** |
+| Nemotron-3-Nano-30B-A3B-NVFP4 | 3B | ~61 |
+Qwen3-Coder-Next is 80B total but 3B active — same throughput as Nemotron-3-Nano with full 262K context.
+## Quick start
+```bash
+# Required — model is gated on Hugging Face (accept license at saricles/Qwen3-Coder-Next-NVFP4-GB10 first):
+export HF_TOKEN=hf_xxxx
+bash start-qwen3-coder-next.sh
+```
+The script will:
+1. Stop and remove any existing `qwen3-coder-next-vllm` container
+2. Flush the system page cache (frees unified memory before vLLM starts)
+3. Start the container in detached mode
+4. Poll `http://localhost:8000/health` until the API is ready (~45 GB download on first run)
+## Test the API
+```bash
+bash test-api.sh              # localhost
+bash test-api.sh 192.168.x.x  # remote host
+```
+## Reasoning (chain-of-thought)
+Reasoning is **ON by default**. Toggle per request:
+```bash
+# Reasoning OFF
+curl -s -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"saricles/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}'
+# Reasoning ON (default)
+curl -s -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"saricles/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}'
+```
+## Cline configuration
+1. Open Cline settings (sidebar icon → gear icon)
+2. Fill in the fields:
+| Field | Value |
+|-------|-------|
+| Provider | OpenAI Compatible |
+| Base URL | `http://<spark-ip>:8000/v1` |
+| Model ID | `saricles/Qwen3-Coder-Next-NVFP4-GB10` |
+| API Key | `dummy` (any non-empty string) |
+## Files
+| File | Purpose |
+|------|---------|
+| `start-qwen3-coder-next.sh` | Full launcher: stop, cache flush, docker run, health poll |
+| `docker-run.sh` | Bare `docker run` command with comments, for reference |
+| `test-api.sh` | curl smoke tests: health, model list, chat completion, reasoning, code generation |
+## How it works
+The `vllm/vllm-openai:cu130-nightly` image includes native `qwen3_next` support.
+**Architecture**: Hybrid of:
+- **DeltaNet** selective linear attention layers (36 of 48 layers) — subquadratic in sequence length, no KV cache
+- **Full attention** layers (12 of 48) — standard transformer attention with KV cache
+- **Latent MoE** (80B total, 3B active per token — same throughput profile as Nano)
+Quantization is via `compressed-tensors` (llmcompressor) with `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1`
+— all 512 MoE experts are calibrated, not just sampled.
+vLLM auto-detects quantization from `quantization_config` in `config.json`; no `--quantization` flag needed.
+### Key environment variables
+| Variable | Reason |
+|----------|--------|
+| `VLLM_NVFP4_GEMM_BACKEND=marlin` | SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts |
+| `VLLM_TEST_FORCE_FP8_MARLIN=1` | Forces FP8 Marlin path on GB10 SM12.1 |
+| `VLLM_USE_FLASHINFER_MOE_FP4=0` | FlashInfer MoE FP4 path not supported on GB10 SM12.1 |
+| `VLLM_MARLIN_USE_ATOMIC_ADD=1` | GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1 |
+### Key flags
+| Flag | Reason |
+|------|--------|
+| `--dtype auto` | BF16 for non-quantized layers (DeltaNet, router gates, lm_head) |
+| `--kv-cache-dtype fp8` | FP8 KV cache; applies to 12 full-attention layers only |
+| `--gpu-memory-utilization 0.90` | 0.90 × 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky) |
+| `--max-model-len 262144` | Full native context; tested by saricles with FP8 KV cache |
+| `--max-num-seqs 64` | Max concurrent requests |
+| `--max-num-batched-tokens 8192` | Prevents OOM on long contexts |
+| `--attention-backend flashinfer` | Required for FP8 KV cache + chunked prefill on GB10 |
+| `--enable-prefix-caching` | Reuses KV cache for repeated prompt prefixes (system prompts) |
+| `--enable-chunked-prefill` | Reduces memory spikes during long-prompt processing |
+| `--tool-call-parser qwen3_coder` | OpenAI-compatible tool calling |
+## Requirements
+- Docker with `nvidia-container-toolkit`
+- Image: `vllm/vllm-openai:cu130-nightly`
+- HF token with access to `saricles/Qwen3-Coder-Next-NVFP4-GB10` (gated — accept license first)
+- Model weights cached locally (auto-downloaded on first run, ~45 GB):
+  `~/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`

docker-run.sh ADDED Viewed

	@@ -0,0 +1,100 @@

+#!/usr/bin/env bash
+# docker-run.sh — bare docker run command (without start-qwen3-coder-next.sh lifecycle logic)
+# Useful for manual testing or embedding in other scripts.
+#
+# Usage: bash docker-run.sh
+#
+# Environment variables:
+#   HF_TOKEN   — optional Hugging Face token (required for gated models)
+#   HF_CACHE   — local weight cache path (default: ~/.cache/huggingface)
+set -euo pipefail
+HF_CACHE="${HF_CACHE:-${HOME}/.cache/huggingface}"
+mkdir -p "${HF_CACHE}"
+docker run \
+    --name qwen3-coder-next-vllm \
+    --rm \
+    --runtime=nvidia \
+    --gpus all \
+    -p 0.0.0.0:8000:8000 \
+    -v "${HF_CACHE}:/root/.cache/huggingface" \
+    --shm-size=32g \
+    -e VLLM_NVFP4_GEMM_BACKEND=marlin \
+    -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
+    -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
+    -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
+    ${HF_TOKEN:+-e HF_TOKEN="${HF_TOKEN}"} \
+    vllm/vllm-openai:cu130-nightly \
+    saricles/Qwen3-Coder-Next-NVFP4-GB10 \
+    --dtype auto \
+    --gpu-memory-utilization 0.90 \
+    --kv-cache-dtype fp8 \
+    --max-model-len 262144 \
+    --max-num-seqs 64 \
+    --max-num-batched-tokens 8192 \
+    --attention-backend flashinfer \
+    --enable-prefix-caching \
+    --enable-chunked-prefill \
+    --enable-auto-tool-choice \
+    --tool-call-parser qwen3_coder \
+    --host 0.0.0.0 \
+    --port 8000
+# ---------------------------------------------------------------------------
+# Flag reference:
+#
+#  vllm/vllm-openai:cu130-nightly
+#       Native qwen3_next support (vLLM 0.19+).
+#
+#  VLLM_NVFP4_GEMM_BACKEND=marlin
+#       SM12.1 (GB10) has no native CUTLASS FP4 kernel.
+#       Marlin handles NVFP4 W4A16 GEMM — 15% faster than CUTLASS for 512 experts.
+#
+#  VLLM_TEST_FORCE_FP8_MARLIN=1
+#       Forces FP8 Marlin path on GB10 SM12.1.
+#
+#  VLLM_USE_FLASHINFER_MOE_FP4=0
+#       FlashInfer MoE FP4 path not supported on GB10 SM12.1.
+#
+#  VLLM_MARLIN_USE_ATOMIC_ADD=1
+#       GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1.
+#
+#  No --quantization flag
+#       compressed-tensors format is auto-detected from config.json.
+#
+#  --dtype auto
+#       BF16 for non-quantized layers (DeltaNet linear_attn, router gates, lm_head).
+#
+#  --gpu-memory-utilization 0.90
+#       0.90 × 128 GB = 115 GB for vLLM. Weights: ~43 GB. KV cache: ~72 GB.
+#       Safe limit per saricles testing (0.93 is risky).
+#
+#  --kv-cache-dtype fp8
+#       FP8 KV cache. Only applies to the 12 full-attention layers (not DeltaNet).
+#
+#  --max-model-len 262144
+#       Full native context. Tested with FP8 KV cache by saricles.
+#
+#  --max-num-seqs 64
+#       Max concurrent requests.
+#
+#  --max-num-batched-tokens 8192
+#       Limits tokens per batch — prevents OOM on long contexts.
+#
+#  --attention-backend flashinfer
+#       Required for FP8 KV cache + chunked prefill on GB10.
+#
+#  --enable-prefix-caching
+#       Reuses KV cache for repeated prompt prefixes (system prompts, etc.).
+#
+#  --enable-chunked-prefill
+#       Reduces memory spikes during long-prompt processing.
+#
+#  --enable-auto-tool-choice --tool-call-parser qwen3_coder
+#       Enables OpenAI-compatible tool calling for this model.
+#
+#  --host 0.0.0.0 --port 8000
+#       OpenAI-compatible REST API, reachable from LAN.
+# ---------------------------------------------------------------------------

start-qwen3-coder-next.sh ADDED Viewed

	@@ -0,0 +1,200 @@

+#!/usr/bin/env bash
+# start-qwen3-coder-next.sh — launch GadflyII/Qwen3-Coder-Next-NVFP4 via vLLM on DGX Spark (GB10)
+# Requirements: Docker with nvidia-container-toolkit, vllm/vllm-openai:cu130-nightly
+#
+# Usage:
+#   bash start-qwen3-coder-next.sh
+#   HF_TOKEN=hf_xxxx bash start-qwen3-coder-next.sh
+#
+# Environment variables:
+#   HF_TOKEN      — HF token (required when the model is gated on huggingface.co)
+#   HF_CACHE_DIR  — local weight cache directory (default: ~/.cache/huggingface)
+#   MAX_MODEL_LEN — context length (default: 131072; model supports up to 262144)
+set -euo pipefail
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+CONTAINER_NAME="qwen3-coder-next-vllm"
+IMAGE="vllm/vllm-openai:cu130-nightly"
+MODEL="saricles/Qwen3-Coder-Next-NVFP4-GB10"
+PORT=8000
+MAX_MODEL_LEN="${MAX_MODEL_LEN:-262144}"
+HF_CACHE_DIR="${HF_CACHE_DIR:-${HOME}/.cache/huggingface}"
+HF_TOKEN="${HF_TOKEN:-}"
+# ---------------------------------------------------------------------------
+# 1. Stop existing container (if running or stopped)
+# ---------------------------------------------------------------------------
+if docker ps --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
+    echo "[INFO] Container '${CONTAINER_NAME}' is running — stopping..."
+    docker stop "${CONTAINER_NAME}"
+    docker rm "${CONTAINER_NAME}"
+    echo "[INFO] Container stopped and removed."
+elif docker ps -a --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
+    echo "[INFO] Container '${CONTAINER_NAME}' exists (stopped) — removing..."
+    docker rm "${CONTAINER_NAME}"
+fi
+# ---------------------------------------------------------------------------
+# 2. Flush system page cache
+#    Frees unified LPDDR5X memory before vLLM starts.
+# ---------------------------------------------------------------------------
+echo "[INFO] Flushing system page cache..."
+sync
+if echo 3 | sudo -n tee /proc/sys/vm/drop_caches > /dev/null 2>&1; then
+    echo "[INFO] Page cache flushed."
+else
+    echo "[WARN] No sudo access — skipping drop_caches (run manually: echo 3 | sudo tee /proc/sys/vm/drop_caches)."
+fi
+# ---------------------------------------------------------------------------
+# 3. Ensure HF cache directory exists
+# ---------------------------------------------------------------------------
+mkdir -p "${HF_CACHE_DIR}"
+# ---------------------------------------------------------------------------
+# 4. Build optional HF_TOKEN env flag
+# ---------------------------------------------------------------------------
+HF_TOKEN_FLAG=""
+if [[ -n "${HF_TOKEN}" ]]; then
+    HF_TOKEN_FLAG="-e HF_TOKEN=${HF_TOKEN}"
+fi
+# ---------------------------------------------------------------------------
+# 5. Start the vLLM container
+#
+# Key decisions:
+#   vllm/vllm-openai:cu130-nightly
+#       Includes native qwen3_next support.
+#
+#   No --reasoning-parser
+#       Qwen3 reasoning parser puts thinking in "reasoning" field and returns
+#       content=null when max_tokens is exhausted before thinking ends.
+#       Clients like Cline don't send chat_template_kwargs to disable thinking,
+#       so they receive content=null and fail. Without the parser, all output
+#       (including <think> blocks) goes into "content" — Cline works correctly.
+#
+#   VLLM_NVFP4_GEMM_BACKEND=marlin
+#       SM12.1 (GB10) has no native CUTLASS FP4 kernel.
+#       Marlin handles NVFP4 W4A16 GEMM — required for correct operation.
+#
+#   VLLM_USE_FLASHINFER_MOE_FP4=0
+#       FlashInfer MoE FP4 path is not supported on GB10 SM12.1.
+#
+#   Quantization (auto-detected from config.json)
+#       quant_method: compressed-tensors, format: nvfp4-pack-quantized.
+#       vLLM reads this automatically — no --quantization flag needed.
+#       All Linear layers quantized to NVFP4 except: DeltaNet linear_attn,
+#       MoE router gates, shared_expert_gate, lm_head (all kept in BF16).
+#
+#   --kv-cache-dtype fp8
+#       FP8 KV cache tested by the model author up to 128K context.
+#       Only applies to the 12 full-attention layers (not DeltaNet layers).
+#
+#   --gpu-memory-utilization 0.55
+#       Unified memory (128 GB pool). 0.55 * 128 GB = 70 GB for vLLM.
+#       Weights: ~44 GB. KV cache: ~23 GB — sufficient for concurrent interactive use.
+#       Sized for combination runs: coder-next+llama31-8b=0.70, coder-next+nano=0.85.
+#
+#   --max-model-len 131072
+#       Tested by the model author. Model supports up to 262144 natively
+#       (large rope_theta, no rope_scaling needed).
+#
+#   --max-num-seqs 8
+#       Max concurrent requests (3B active params — latent MoE).
+#
+#   --max-cudagraph-capture-size 128
+#       Limits CUDA graph capture batch sizes for stability on GB10.
+#
+#   --reasoning-parser qwen3
+#       Qwen3-Next uses <think>...</think> chain-of-thought (same as Qwen3).
+#       Reasoning is ON by default; disable per-request:
+#       chat_template_kwargs={"enable_thinking": false}
+#
+#   --host 0.0.0.0
+#       Required for LAN access (e.g. Cline running on a different machine).
+# ---------------------------------------------------------------------------
+echo "[INFO] Starting container '${CONTAINER_NAME}'..."
+# shellcheck disable=SC2086
+docker run -d \
+    --name "${CONTAINER_NAME}" \
+    --runtime=nvidia \
+    --gpus all \
+    -p 0.0.0.0:"${PORT}":"${PORT}" \
+    -v "${HF_CACHE_DIR}:/root/.cache/huggingface" \
+    --shm-size=16g \
+    -e VLLM_NVFP4_GEMM_BACKEND=marlin \
+    -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
+    -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
+    -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
+    ${HF_TOKEN_FLAG} \
+    "${IMAGE}" \
+    "${MODEL}" \
+    --dtype auto \
+    --gpu-memory-utilization 0.90 \
+    --kv-cache-dtype fp8 \
+    --max-model-len "${MAX_MODEL_LEN}" \
+    --max-num-seqs 64 \
+    --max-num-batched-tokens 8192 \
+    --attention-backend flashinfer \
+    --enable-prefix-caching \
+    --enable-chunked-prefill \
+    --enable-auto-tool-choice \
+    --tool-call-parser qwen3_coder \
+    --host 0.0.0.0 \
+    --port "${PORT}"
+echo "[INFO] Container started (detached). Waiting for API to become ready..."
+echo "[INFO] Follow logs: docker logs -f ${CONTAINER_NAME}"
+echo ""
+# ---------------------------------------------------------------------------
+# 6. Wait for API readiness (up to 15 minutes — large model download)
+# ---------------------------------------------------------------------------
+HEALTH_URL="http://localhost:${PORT}/health"
+MAX_WAIT=900
+INTERVAL=10
+elapsed=0
+while true; do
+    if curl -sf "${HEALTH_URL}" > /dev/null 2>&1; then
+        echo ""
+        echo "[OK] vLLM API is ready!"
+        echo "[OK] OpenAI-compatible endpoint: http://0.0.0.0:${PORT}/v1"
+        echo "[OK] Cline configuration:"
+        echo "       Base URL : http://<spark-ip>:${PORT}/v1"
+        echo "       Model ID : ${MODEL}"
+        echo "       API Key  : none"
+        break
+    fi
+    if ! docker ps --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}$"; then
+        echo ""
+        echo "[ERROR] Container crashed during model load!"
+        echo "        Last logs:"
+        docker logs --tail 50 "${CONTAINER_NAME}" 2>/dev/null || true
+        exit 1
+    fi
+    if [[ ${elapsed} -ge ${MAX_WAIT} ]]; then
+        echo ""
+        echo "[ERROR] API did not respond within ${MAX_WAIT}s."
+        echo "        Check logs: docker logs ${CONTAINER_NAME}"
+        exit 1
+    fi
+    printf "."
+    sleep "${INTERVAL}"
+    elapsed=$(( elapsed + INTERVAL ))
+done
+# ---------------------------------------------------------------------------
+# 7. Print recent logs
+# ---------------------------------------------------------------------------
+echo ""
+echo "[INFO] Recent container logs:"
+docker logs --tail 20 "${CONTAINER_NAME}"

test-api.sh ADDED Viewed

	@@ -0,0 +1,149 @@

+#!/usr/bin/env bash
+# test-api.sh — smoke tests for the vLLM API
+# Usage:
+#   bash test-api.sh                      # test localhost:8000
+#   bash test-api.sh 192.168.1.50         # test remote host
+#   bash test-api.sh 192.168.1.50 8080    # remote host, custom port
+set -euo pipefail
+HOST="${1:-localhost}"
+PORT="${2:-8000}"
+BASE_URL="http://${HOST}:${PORT}/v1"
+if [[ -t 1 ]]; then
+    GREEN="\033[0;32m"; RED="\033[0;31m"; YELLOW="\033[0;33m"; NC="\033[0m"
+else
+    GREEN=""; RED=""; YELLOW=""; NC=""
+fi
+ok()   { echo -e "${GREEN}[OK]${NC}  $*"; }
+fail() { echo -e "${RED}[FAIL]${NC} $*"; }
+info() { echo -e "${YELLOW}[INFO]${NC} $*"; }
+echo "============================================================"
+echo "  vLLM API smoke tests — ${BASE_URL}"
+echo "============================================================"
+echo ""
+# ---------------------------------------------------------------------------
+# Test 1: Health endpoint
+# ---------------------------------------------------------------------------
+info "Test 1: /health"
+HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "${BASE_URL%/v1}/health")
+if [[ "${HTTP_CODE}" == "200" ]]; then
+    ok "/health returned HTTP 200"
+else
+    fail "/health returned HTTP ${HTTP_CODE} (server may still be loading)"
+fi
+echo ""
+# ---------------------------------------------------------------------------
+# Test 2: Model list
+# ---------------------------------------------------------------------------
+info "Test 2: GET /v1/models"
+MODELS_RESPONSE=$(curl -s "${BASE_URL}/models")
+echo "${MODELS_RESPONSE}" | python3 -m json.tool 2>/dev/null || echo "${MODELS_RESPONSE}"
+MODEL_ID=$(echo "${MODELS_RESPONSE}" | python3 -c \
+    "import sys,json; data=json.load(sys.stdin); print(data['data'][0]['id'])" 2>/dev/null || echo "")
+if [[ -n "${MODEL_ID}" ]]; then
+    ok "Model loaded: ${MODEL_ID}"
+else
+    fail "Could not parse model list"
+    MODEL_ID="GadflyII/Qwen3-Coder-Next-NVFP4"
+fi
+echo ""
+# ---------------------------------------------------------------------------
+# Test 3: Chat completion (reasoning off)
+# ---------------------------------------------------------------------------
+info "Test 3: POST /v1/chat/completions (reasoning off)"
+RESPONSE=$(curl -s \
+    -X POST "${BASE_URL}/chat/completions" \
+    -H "Content-Type: application/json" \
+    -d "{
+        \"model\": \"${MODEL_ID}\",
+        \"messages\": [{\"role\": \"user\", \"content\": \"Reply in one sentence: what is the capital of France?\"}],
+        \"max_tokens\": 60,
+        \"temperature\": 0.1,
+        \"chat_template_kwargs\": {\"enable_thinking\": false}
+    }")
+CONTENT=$(echo "${RESPONSE}" | python3 -c \
+    "import sys,json; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])" 2>/dev/null || echo "")
+if [[ -n "${CONTENT}" ]]; then
+    ok "Chat completion works."
+    echo "  >> ${CONTENT}"
+else
+    fail "No response"
+    echo "${RESPONSE}" | python3 -m json.tool 2>/dev/null || echo "${RESPONSE}"
+fi
+echo ""
+# ---------------------------------------------------------------------------
+# Test 4: Chat completion (reasoning on)
+# ---------------------------------------------------------------------------
+info "Test 4: POST /v1/chat/completions (reasoning on)"
+RESPONSE=$(curl -s \
+    -X POST "${BASE_URL}/chat/completions" \
+    -H "Content-Type: application/json" \
+    -d "{
+        \"model\": \"${MODEL_ID}\",
+        \"messages\": [{\"role\": \"user\", \"content\": \"What is 17 * 23? Show your work.\"}],
+        \"max_tokens\": 1000,
+        \"temperature\": 0.1,
+        \"chat_template_kwargs\": {\"enable_thinking\": true}
+    }")
+CONTENT=$(echo "${RESPONSE}" | python3 -c \
+    "import sys,json; r=json.load(sys.stdin); m=r['choices'][0]['message']; thinking=m.get('reasoning_content') or m.get('reasoning',''); print('thinking:', repr(thinking)[:80], '\nanswer:', m.get('content',''))" \
+    2>/dev/null || echo "")
+if [[ -n "${CONTENT}" ]]; then
+    ok "Reasoning mode works."
+    echo "${CONTENT}"
+else
+    fail "No response from reasoning mode"
+fi
+echo ""
+# ---------------------------------------------------------------------------
+# Test 5: Code generation
+# ---------------------------------------------------------------------------
+info "Test 5: Code generation"
+RESPONSE=$(curl -s \
+    -X POST "${BASE_URL}/chat/completions" \
+    -H "Content-Type: application/json" \
+    -d "{
+        \"model\": \"${MODEL_ID}\",
+        \"messages\": [{\"role\": \"user\", \"content\": \"Write a Python function that returns the nth Fibonacci number using memoization.\"}],
+        \"max_tokens\": 300,
+        \"temperature\": 0.1,
+        \"chat_template_kwargs\": {\"enable_thinking\": false}
+    }")
+CODE=$(echo "${RESPONSE}" | python3 -c \
+    "import sys,json; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])" 2>/dev/null || echo "")
+if [[ -n "${CODE}" ]]; then
+    ok "Code generation works."
+    echo "${CODE}" | head -10
+    echo "  ..."
+else
+    fail "No code response"
+fi
+echo ""
+# ---------------------------------------------------------------------------
+# Summary
+# ---------------------------------------------------------------------------
+echo "============================================================"
+echo "  Cline configuration (OpenAI Compatible provider):"
+echo ""
+echo "    Base URL : ${BASE_URL}"
+echo "    Model ID : ${MODEL_ID}"
+echo "    API Key  : none  (any non-empty string)"
+echo "============================================================"