Switch model source to gdubicki/Qwen3-Coder-Next-NVFP4-GB10

088ba1d verified about 2 months ago

6.58 kB

	# gdubicki/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10)

	Runs [`gdubicki/Qwen3-Coder-Next-NVFP4-GB10`](https://huggingface.co/gdubicki/Qwen3-Coder-Next-NVFP4-GB10) (quantized by [saricles](https://huggingface.co/saricles/Qwen3-Coder-Next-NVFP4-GB10)) via vLLM with an OpenAI-compatible API endpoint.
	Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X).

	## Model overview

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| `qwen3_next` (Hybrid DeltaNet linear attention + full attention + latent MoE) \|
	\| Base model \| `Qwen/Qwen3-Coder-Next` \|
	\| Parameters \| 80B total, 3B active per token (512 experts, 10 active + 1 shared) \|
	\| Layers \| 48 (pattern: 3× DeltaNet linear → 1× full attention, repeating → 12 full attention) \|
	\| Quantization \| NVFP4 via `llmcompressor` + `compressed-tensors`; all 512 MoE experts calibrated \|
	\| Kept in BF16 \| `lm_head`, `embed_tokens`, `linear_attn` layers, `mlp.gate`, `mlp.shared_expert_gate` \|
	\| KV cache \| FP8 (only for the 12 full-attention layers) \|
	\| Model size \| ~45 GB (70% reduction from ~149 GB BF16) \|
	\| Max context \| 262,144 tokens (native; tested with FP8 KV cache by saricles) \|
	\| Reasoning \| Built-in chain-of-thought (`<think>` tags), ON by default \|

	## Model features

	\| Feature \| Support \|
	\|---------\|---------\|
	\| Tool calling \| ✅ Yes (`--tool-call-parser qwen3_coder`) \|
	\| Reasoning / thinking mode \| ✅ Yes (ON by default, toggleable via `enable_thinking`) \|
	\| Languages \| multilingual (code-focused) \|

	## Performance

	Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Throughput \| ~61 tok/s \|
	\| Max context \| 262,144 tokens \|
	\| KV cache concurrency \| 31.65× at 262K tokens (DeltaNet has no KV cache) \|

	Comparison across models on GB10:

	\| Model \| Active params \| tok/s \|
	\|-------\|--------------\|-------\|
	\| Gemma-4-31B-IT-NVFP4 (dense) \| 31B \| ~7 \|
	\| Qwen3-32B-NVFP4 (dense) \| 32.8B \| ~11 \|
	\| Nemotron-3-Super-120B-A12B-NVFP4 \| 12B \| ~16 \|
	\| Qwen3-Coder-Next-NVFP4-GB10 \| 3B \| ~61 \|
	\| Nemotron-3-Nano-30B-A3B-NVFP4 \| 3B \| ~61 \|

	Qwen3-Coder-Next is 80B total but 3B active — same throughput as Nemotron-3-Nano with full 262K context.

	## Quick start

	```bash
	# Required — model is gated on Hugging Face (accept license at gdubicki/Qwen3-Coder-Next-NVFP4-GB10 first):
	export HF_TOKEN=hf_xxxx

	bash start-qwen3-coder-next.sh
	```

	The script will:
	1. Stop and remove any existing `qwen3-coder-next-vllm` container
	2. Flush the system page cache (frees unified memory before vLLM starts)
	3. Start the container in detached mode
	4. Poll `http://localhost:8000/health` until the API is ready (~45 GB download on first run)

	## Test the API

	```bash
	bash test-api.sh # localhost
	bash test-api.sh 192.168.x.x # remote host
	```

	## Reasoning (chain-of-thought)

	Reasoning is ON by default. Toggle per request:

	```bash
	# Reasoning OFF
	curl -s -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}'

	# Reasoning ON (default)
	curl -s -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}'
	```

	## Cline configuration

	1. Open Cline settings (sidebar icon → gear icon)
	2. Fill in the fields:

	\| Field \| Value \|
	\|-------\|-------\|
	\| Provider \| OpenAI Compatible \|
	\| Base URL \| `http://<spark-ip>:8000/v1` \|
	\| Model ID \| `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` \|
	\| API Key \| `dummy` (any non-empty string) \|

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `start-qwen3-coder-next.sh` \| Full launcher: stop, cache flush, docker run, health poll \|
	\| `docker-run.sh` \| Bare `docker run` command with comments, for reference \|
	\| `test-api.sh` \| curl smoke tests: health, model list, chat completion, reasoning, code generation \|

	## How it works

	The `vllm/vllm-openai:cu130-nightly` image includes native `qwen3_next` support.

	Architecture: Hybrid of:
	- DeltaNet selective linear attention layers (36 of 48 layers) — subquadratic in sequence length, no KV cache
	- Full attention layers (12 of 48) — standard transformer attention with KV cache
	- Latent MoE (80B total, 3B active per token — same throughput profile as Nano)

	Quantization is via `compressed-tensors` (llmcompressor) with `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1`
	— all 512 MoE experts are calibrated, not just sampled.
	vLLM auto-detects quantization from `quantization_config` in `config.json`; no `--quantization` flag needed.

	### Key environment variables

	\| Variable \| Reason \|
	\|----------\|--------\|
	\| `VLLM_NVFP4_GEMM_BACKEND=marlin` \| SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts \|
	\| `VLLM_TEST_FORCE_FP8_MARLIN=1` \| Forces FP8 Marlin path on GB10 SM12.1 \|
	\| `VLLM_USE_FLASHINFER_MOE_FP4=0` \| FlashInfer MoE FP4 path not supported on GB10 SM12.1 \|
	\| `VLLM_MARLIN_USE_ATOMIC_ADD=1` \| GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1 \|

	### Key flags

	\| Flag \| Reason \|
	\|------\|--------\|
	\| `--dtype auto` \| BF16 for non-quantized layers (DeltaNet, router gates, lm_head) \|
	\| `--kv-cache-dtype fp8` \| FP8 KV cache; applies to 12 full-attention layers only \|
	\| `--gpu-memory-utilization 0.90` \| 0.90 × 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky) \|
	\| `--max-model-len 262144` \| Full native context; tested by saricles with FP8 KV cache \|
	\| `--max-num-seqs 64` \| Max concurrent requests \|
	\| `--max-num-batched-tokens 8192` \| Prevents OOM on long contexts \|
	\| `--attention-backend flashinfer` \| Required for FP8 KV cache + chunked prefill on GB10 \|
	\| `--enable-prefix-caching` \| Reuses KV cache for repeated prompt prefixes (system prompts) \|
	\| `--enable-chunked-prefill` \| Reduces memory spikes during long-prompt processing \|
	\| `--tool-call-parser qwen3_coder` \| OpenAI-compatible tool calling \|

	## Requirements

	- Docker with `nvidia-container-toolkit`
	- Image: `vllm/vllm-openai:cu130-nightly`
	- HF token with access to `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` (gated — accept license first)
	- Model weights cached locally (auto-downloaded on first run, ~45 GB):
	`~/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`

	# gdubicki/Qwen3-Coder-Next-NVFP4-GB10 on DGX Spark (GB10)

	Runs [`gdubicki/Qwen3-Coder-Next-NVFP4-GB10`](https://huggingface.co/gdubicki/Qwen3-Coder-Next-NVFP4-GB10) (quantized by [saricles](https://huggingface.co/saricles/Qwen3-Coder-Next-NVFP4-GB10)) via vLLM with an OpenAI-compatible API endpoint.
	Tested on DGX Spark (GB10 Blackwell, SM12.1, 128 GB unified LPDDR5X).

	## Model overview

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| `qwen3_next` (Hybrid DeltaNet linear attention + full attention + latent MoE) \|
	\| Base model \| `Qwen/Qwen3-Coder-Next` \|
	\| Parameters \| 80B total, 3B active per token (512 experts, 10 active + 1 shared) \|
	\| Layers \| 48 (pattern: 3× DeltaNet linear → 1× full attention, repeating → 12 full attention) \|
	\| Quantization \| NVFP4 via `llmcompressor` + `compressed-tensors`; all 512 MoE experts calibrated \|
	\| Kept in BF16 \| `lm_head`, `embed_tokens`, `linear_attn` layers, `mlp.gate`, `mlp.shared_expert_gate` \|
	\| KV cache \| FP8 (only for the 12 full-attention layers) \|
	\| Model size \| ~45 GB (70% reduction from ~149 GB BF16) \|
	\| Max context \| 262,144 tokens (native; tested with FP8 KV cache by saricles) \|
	\| Reasoning \| Built-in chain-of-thought (`<think>` tags), ON by default \|

	## Model features

	\| Feature \| Support \|
	\|---------\|---------\|
	\| Tool calling \| ✅ Yes (`--tool-call-parser qwen3_coder`) \|
	\| Reasoning / thinking mode \| ✅ Yes (ON by default, toggleable via `enable_thinking`) \|
	\| Languages \| multilingual (code-focused) \|

	## Performance

	Measured on DGX Spark (GB10, SM12.1, 128 GB unified LPDDR5X):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Throughput \| ~61 tok/s \|
	\| Max context \| 262,144 tokens \|
	\| KV cache concurrency \| 31.65× at 262K tokens (DeltaNet has no KV cache) \|

	Comparison across models on GB10:

	\| Model \| Active params \| tok/s \|
	\|-------\|--------------\|-------\|
	\| Gemma-4-31B-IT-NVFP4 (dense) \| 31B \| ~7 \|
	\| Qwen3-32B-NVFP4 (dense) \| 32.8B \| ~11 \|
	\| Nemotron-3-Super-120B-A12B-NVFP4 \| 12B \| ~16 \|
	\| Qwen3-Coder-Next-NVFP4-GB10 \| 3B \| ~61 \|
	\| Nemotron-3-Nano-30B-A3B-NVFP4 \| 3B \| ~61 \|

	Qwen3-Coder-Next is 80B total but 3B active — same throughput as Nemotron-3-Nano with full 262K context.

	## Quick start

	```bash
	# Required — model is gated on Hugging Face (accept license at gdubicki/Qwen3-Coder-Next-NVFP4-GB10 first):
	export HF_TOKEN=hf_xxxx

	bash start-qwen3-coder-next.sh
	```

	The script will:
	1. Stop and remove any existing `qwen3-coder-next-vllm` container
	2. Flush the system page cache (frees unified memory before vLLM starts)
	3. Start the container in detached mode
	4. Poll `http://localhost:8000/health` until the API is ready (~45 GB download on first run)

	## Test the API

	```bash
	bash test-api.sh # localhost
	bash test-api.sh 192.168.x.x # remote host
	```

	## Reasoning (chain-of-thought)

	Reasoning is ON by default. Toggle per request:

	```bash
	# Reasoning OFF
	curl -s -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"What is the capital of France?"}],"max_tokens":60,"chat_template_kwargs":{"enable_thinking":false}}'

	# Reasoning ON (default)
	curl -s -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{"model":"gdubicki/Qwen3-Coder-Next-NVFP4-GB10","messages":[{"role":"user","content":"Write a binary search implementation in Python."}],"max_tokens":2000,"chat_template_kwargs":{"enable_thinking":true}}'
	```

	## Cline configuration

	1. Open Cline settings (sidebar icon → gear icon)
	2. Fill in the fields:

	\| Field \| Value \|
	\|-------\|-------\|
	\| Provider \| OpenAI Compatible \|
	\| Base URL \| `http://<spark-ip>:8000/v1` \|
	\| Model ID \| `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` \|
	\| API Key \| `dummy` (any non-empty string) \|

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `start-qwen3-coder-next.sh` \| Full launcher: stop, cache flush, docker run, health poll \|
	\| `docker-run.sh` \| Bare `docker run` command with comments, for reference \|
	\| `test-api.sh` \| curl smoke tests: health, model list, chat completion, reasoning, code generation \|

	## How it works

	The `vllm/vllm-openai:cu130-nightly` image includes native `qwen3_next` support.

	Architecture: Hybrid of:
	- DeltaNet selective linear attention layers (36 of 48 layers) — subquadratic in sequence length, no KV cache
	- Full attention layers (12 of 48) — standard transformer attention with KV cache
	- Latent MoE (80B total, 3B active per token — same throughput profile as Nano)

	Quantization is via `compressed-tensors` (llmcompressor) with `LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1`
	— all 512 MoE experts are calibrated, not just sampled.
	vLLM auto-detects quantization from `quantization_config` in `config.json`; no `--quantization` flag needed.

	### Key environment variables

	\| Variable \| Reason \|
	\|----------\|--------\|
	\| `VLLM_NVFP4_GEMM_BACKEND=marlin` \| SM12.1 (GB10) has no native CUTLASS FP4 kernel; Marlin is 15% faster for 512 experts \|
	\| `VLLM_TEST_FORCE_FP8_MARLIN=1` \| Forces FP8 Marlin path on GB10 SM12.1 \|
	\| `VLLM_USE_FLASHINFER_MOE_FP4=0` \| FlashInfer MoE FP4 path not supported on GB10 SM12.1 \|
	\| `VLLM_MARLIN_USE_ATOMIC_ADD=1` \| GB10-specific Marlin optimization for correct FP4 GEMM on SM12.1 \|

	### Key flags

	\| Flag \| Reason \|
	\|------\|--------\|
	\| `--dtype auto` \| BF16 for non-quantized layers (DeltaNet, router gates, lm_head) \|
	\| `--kv-cache-dtype fp8` \| FP8 KV cache; applies to 12 full-attention layers only \|
	\| `--gpu-memory-utilization 0.90` \| 0.90 × 128 GB = 115 GB; covers ~43 GB weights + ~72 GB KV cache (0.93 is risky) \|
	\| `--max-model-len 262144` \| Full native context; tested by saricles with FP8 KV cache \|
	\| `--max-num-seqs 64` \| Max concurrent requests \|
	\| `--max-num-batched-tokens 8192` \| Prevents OOM on long contexts \|
	\| `--attention-backend flashinfer` \| Required for FP8 KV cache + chunked prefill on GB10 \|
	\| `--enable-prefix-caching` \| Reuses KV cache for repeated prompt prefixes (system prompts) \|
	\| `--enable-chunked-prefill` \| Reduces memory spikes during long-prompt processing \|
	\| `--tool-call-parser qwen3_coder` \| OpenAI-compatible tool calling \|

	## Requirements

	- Docker with `nvidia-container-toolkit`
	- Image: `vllm/vllm-openai:cu130-nightly`
	- HF token with access to `gdubicki/Qwen3-Coder-Next-NVFP4-GB10` (gated — accept license first)
	- Model weights cached locally (auto-downloaded on first run, ~45 GB):
	`~/.cache/huggingface/hub/models--saricles--Qwen3-Coder-Next-NVFP4-GB10/`