Instructions to use OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4")
model = AutoModelForCausalLM.from_pretrained("OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4

SGLang

How to use OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4 with Docker Model Runner:
```
docker model run hf.co/OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4
```

Qwen3-Coder-30B-A3B-Instruct-NVFP4

quantized to NVFP4 using NVIDIA Model Optimizer.

Quantized by: OPENZEKA

Qwen3‑Coder‑30B‑A3B‑Instruct: Full‑Precision vs. NVFP4‑Quantized Performance Comparison

The full‑precision (FP16/FP32) version of the Qwen/Qwen3‑Coder‑30B‑A3B‑Instruct model was compared against its NVFP4‑quantized version on the same hardware (DGX Spark) and inference engine (vLLM) under identical test conditions:

Concurrency levels: 1, 2, 4, 8, 16, 32
Prompt length: ≈128 tokens (64 different prompts)
Maximum output length: 128 tokens

This comparison clearly demonstrates the speed and efficiency advantages of the quantized model, especially under high concurrency.

Main Findings (Summary)

The NVFP4‑quantized model is significantly faster at every concurrency level:
- TTFT (Time‑to‑First‑Token) is 2–3 × lower.
- ITL (Inter‑Token‑Latency) is about 50‑60 % lower.
- TPS (Tokens‑Per‑Second) is 2–2.7 × higher.
- Total latency is 2–2.6 × shorter.
- Throughput (RPS) is 2–2.5 × higher.
Quantization delivers a large advantage at low‑ and medium‑concurrency workloads and maintains its superiority even at high concurrency (16–32).
These results show that NVFP4 quantization is a very effective optimization on NVIDIA hardware (DGX Spark).

Detailed Comparison Tables

Full‑Precision (FP) Model

Concurrency	TTFT Mean (ms)	TTFT p90 (ms)	ITL Mean (ms)	TPS Mean (tokens / s)	Latency Mean (s)	Throughput (RPS)
1	170.78	178.27	32.40	29.86	4.29	0.23
2	101.37	111.49	40.90	24.16	5.23	0.19
4	124.02	171.37	57.31	17.30	7.35	0.14
8	159.58	225.98	77.57	12.79	9.87	0.10
16	179.61	237.59	99.43	9.96	12.36	0.08
32	176.88	234.53	123.04	8.06	15.27	0.07

NVFP4‑Quantized Model

Concurrency	TTFT Mean (ms)	TTFT p90 (ms)	ITL Mean (ms)	TPS Mean (tokens / s)	Latency Mean (s)	Throughput (RPS)
1	66.47	70.55	14.98	64.99	1.97	0.51
2	48.79	55.39	18.07	53.75	2.28	0.44
4	59.03	70.68	23.27	42.44	2.98	0.34
8	76.21	93.75	29.38	33.59	3.72	0.27
16	78.39	98.40	36.21	27.29	4.63	0.22
32	92.31	138.62	45.40	21.75	5.89	0.17

Metric‑by‑Metric Analysis

TTFT (Time‑to‑First‑Token)
- Concurrency = 1: FP ≈ 171 ms → Quantized ≈ 66 ms (~2.6× faster).
- Concurrency = 32: FP ≈ 177 ms → Quantized ≈ 92 ms (still ~1.9× faster).
- The quantized model delivers the first token far more quickly, which dramatically improves user‑perceived latency, especially at low concurrency.
ITL (Inter‑Token Latency)
- In the FP model, ITL rises sharply with concurrency (up to 123 ms at 32).
- In the quantized model the increase is much more modest (45 ms at 32), ~2.7× lower than FP.
- This indicates that the quantized model utilizes memory bandwidth and compute resources far more efficiently.
TPS (Tokens‑Per‑Second)
- Concurrency = 1: FP 29.9 t/s → Quantized 65 t/s (2.2× increase).
- Concurrency = 32: FP 8.1 t/s → Quantized 21.8 t/s (2.7× increase).
- Even under heavy load, the quantized model maintains a far higher token‑generation rate.
Total Latency (for a 128‑token output)
- FP model ranges from 4.3 s (best) to 15.3 s (worst).
- Quantized model stays within 2.0 s – 5.9 s, i.e., 2–2.6× faster on average.
Throughput (Requests‑Per‑Second)
- FP model peaks at ≈ 0.23 RPS (single request).
- Quantized model reaches 0.51 RPS for a single request and still delivers 0.17 RPS at concurrency = 32 (2–2.5× higher).
- This enables a service to handle many more concurrent requests.

Conclusion

NVFP4 quantization provides substantial performance gains for the Qwen3‑Coder‑30B‑A3B‑Instruct model. On NVIDIA DGX Spark hardware, it doubles to nearly triples the speed while preserving the same model quality (quality impact was not measured here). The benefits are evident across all concurrency levels, making the quantized version the clear choice for production deployments such as API services, chatbots, and other high‑traffic applications where low latency and high throughput are critical.

In short, the quantized model delivers much lower latency, higher token‑throughput, and considerably higher request‑throughput, confirming that NVFP4 quantization is a highly effective optimization for large language models on modern NVIDIA GPUs.

Downloads last month: 596

Safetensors

Model size

16B params

Tensor type

BF16

F8_E4M3

Model tree for OPENZEKA/Qwen3-Coder-30B-A3B-Instruct-NVFP4

Base model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Quantized

(141)

this model