Instructions to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="saricles/Qwen3-Coder-Next-NVFP4-GB10")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("saricles/Qwen3-Coder-Next-NVFP4-GB10")
model = AutoModelForCausalLM.from_pretrained("saricles/Qwen3-Coder-Next-NVFP4-GB10")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "saricles/Qwen3-Coder-Next-NVFP4-GB10"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "saricles/Qwen3-Coder-Next-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/saricles/Qwen3-Coder-Next-NVFP4-GB10

SGLang

How to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "saricles/Qwen3-Coder-Next-NVFP4-GB10" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "saricles/Qwen3-Coder-Next-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "saricles/Qwen3-Coder-Next-NVFP4-GB10" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "saricles/Qwen3-Coder-Next-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use saricles/Qwen3-Coder-Next-NVFP4-GB10 with Docker Model Runner:
```
docker model run hf.co/saricles/Qwen3-Coder-Next-NVFP4-GB10
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Qwen3-Coder-Next-NVFP4-GB10

NVFP4 quantization of Qwen/Qwen3-Coder-Next for NVIDIA DGX Spark (GB10).

Qwen3-Coder-Next is a 79.7B-parameter MoE coding model (512 experts, 10 active per token) with hybrid DeltaNet+attention architecture. This quantization uses a GB10-tuned ignore list that quantizes more aggressively than standard NVFP4 configurations.

Model Details


Base Model	Qwen/Qwen3-Coder-Next
Architecture	Qwen3NextForCausalLM (Hybrid MoE — DeltaNet + attention)
Total Parameters	79.7B
Active Parameters	~3B per token (512 experts, 10 active)
Quantization	NVFP4 (4-bit floating point) via LLM Compressor
Format	compressed-tensors (safetensors), 10 shards
Size on Disk	45.9 GB
Context Length	262,144 tokens (262K)
License	Apache 2.0

Quantization Details

Method: Post-training quantization via LLM Compressor
Calibration Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
Calibration Samples: 64
Max Sequence Length: 2048 tokens
Environment: LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1

Ignore List (layers kept in BF16)

lm_head
model.embed_tokens
re:.*linear_attn.conv1d
re:.*linear_attn.in_proj_ba
re:.*mlp.gate$
re:.*mlp.shared_expert_gate$

Everything else — including in_proj_qkvz — is quantized to FP4. On GB10's 221 GB/s bandwidth, the bandwidth savings from quantizing these layers outweigh the FP4 kernel dispatch overhead.

Performance (Single NVIDIA DGX Spark — GB10, 128 GB)

Benchmarked with llama-benchy v0.3.3, 3 runs per config.

PP	TG	Prefill (tok/s)	Decode (tok/s)	TTFT (ms)
512	128	2,024	62.0	285
512	256	2,528	62.1	206
1024	128	3,261	60.6	319
1024	256	3,350	61.8	309
4096	128	3,987	61.1	1,031
4096	256	3,971	61.1	1,035

Metric	Value
Model memory	42.7 GiB
KV cache	61.7 GiB (1,346,432 tokens)
Concurrent sessions @ 262K	~5
Concurrent sessions @ 65K	~20

The hybrid DeltaNet+attention architecture means decode speed is constant regardless of context length — DeltaNet layers don't use KV cache.

Running on a Single DGX Spark

Docker image: avarok/dgx-vllm-nvfp4-kernel:v23 (vLLM 0.16.0-rc2, CUDA 13.0, SM 12.1)

Download the model:

huggingface-cli download saricles/Qwen3-Coder-Next-NVFP4-GB10 \
  --local-dir /opt/huggingface/models/Qwen3-Coder-Next-NVFP4-GB10

Launch:

docker run -d --name coder-next --gpus all --ipc=host --shm-size 32g \
  -v /opt/huggingface/models/Qwen3-Coder-Next-NVFP4-GB10:/models/Qwen3-Coder-Next-NVFP4-GB10 \
  -p 8000:8000 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -e MODEL=/models/Qwen3-Coder-Next-NVFP4-GB10 \
  -e PORT=8000 \
  -e MAX_MODEL_LEN=262144 \
  -e GPU_MEMORY_UTIL=0.90 \
  -e "VLLM_EXTRA_ARGS=--kv-cache-dtype fp8 --attention-backend flashinfer --enable-prefix-caching --enable-chunked-prefill --max-num-batched-tokens 8192 --max-num-seqs 64 --enable-auto-tool-choice --tool-call-parser qwen3_coder" \
  avarok/dgx-vllm-nvfp4-kernel:v23

Test it:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-Coder-Next-NVFP4-GB10",
    "messages": [{"role": "user", "content": "Write a Python function to find the longest common subsequence"}],
    "temperature": 0.7,
    "max_tokens": 2048
  }'

Notes

At 42.7 GiB model weight + 0.90 GPU util, you get ~62 GiB for KV cache — enough for 5 concurrent 262K sessions.
gpu_memory_utilization=0.93 works but leaves very little system headroom. 0.90 is safer.
Decode speed is constant across context lengths thanks to the DeltaNet hybrid architecture.
Marlin backend is 15% faster than VLLM_CUTLASS for this model's 512 experts.

Target Hardware

Quantized and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell GPUs with NVFP4 support.

Acknowledgments

Base model by Qwen
Quantization tooling by vLLM / LLM Compressor

Downloads last month: 5,278

Model tree for saricles/Qwen3-Coder-Next-NVFP4-GB10

Base model

Qwen/Qwen3-Coder-Next

Quantized

(105)

this model

Quantizations

2 models