Qwen3-Coder-480B-A35B-Instruct Model NVFP4 Quantized

Qwen3‑Coder‑480B‑A35B‑Instruct Model Comparison Full vs NVFP4

Test Configuration

Parameter	Setting
Full‑Precision Model	DGX-B300 / 4 GPU
NVFP4 Quantized Model	DGX-B300 / 4 GPU
Inference Engine	TRT‑LLM (TensorRT‑LLM)
Tested Concurrency Levels	1, 2, 4, 8, 16, 32
Prompt Length	≈ 128 tokens (64 different prompts)
Maximum Response Length	128 tokens

Performance Metrics Comparison

1. Time to First Token (TTFT) – milliseconds

Full Model	NVFP Model

Concurrency	Full Model	NVFP4 Model	Δ (ms)	Performance Loss
1	73.46	92.56	+19.10	+26.0 %
2	136.82	173.48	+36.66	+26.8 %
4	130.01	163.84	+33.83	+26.0 %
8	136.87	177.42	+40.55	+29.6 %
16	163.07	174.25	+11.18	+6.9 %
32	134.69	169.11	+34.42	+25.6 %

TTFT Analysis

The NVFP4 model shows an average +26.5 % higher TTFT across all concurrency levels.
The greatest performance degradation occurs at concurrency 8 (+29.6 %).
The smallest degradation is at concurrency 16 (+6.9 %).

2. Inter‑Token Latency (ITL) – milliseconds

Full Model	NVFP Model

Concurrency	Full Model	NVFP4 Model	Δ (ms)	Performance Loss
1	8.31	8.99	+0.68	+8.2 %
2	9.92	10.01	+0.09	+0.9 %
4	12.11	11.52	–0.59	–4.9 %
8	14.99	13.66	–1.33	–8.9 %
16	18.42	15.68	–2.74	–14.9 %
32	22.12	18.03	–4.09	–18.5 %

ITL Analysis

At low concurrency (1‑2) the NVFP4 model is slightly slower.
From medium to high concurrency (8‑32) the NVFP4 model outperforms the full‑precision model, achieving up to ‑18.5 % lower latency at concurrency 32.

3. Tokens Per Second (TPS) – tokens / s

Full Model	NVFP Model

Concurrency	Full Model	NVFP4 Model	Δ (tokens/s)	Performance Change
1	112.61	103.54	–9.07	–8.1 %
2	91.60	88.53	–3.07	–3.3 %
4	76.61	78.11	+1.50	+2.0 %
8	62.58	66.77	+4.19	+6.7 %
16	51.03	58.03	+7.00	+13.7 %
32	43.37	51.75	+8.38	+19.3 %

TPS Analysis

The full‑precision model is faster at low concurrency (1‑2).
From concurrency 4 upward, the NVFP4 model yields higher throughput, reaching +19.3 % at concurrency 32.

4. Total Latency – seconds

Full Model	NVFP Model

Concurrency	Full Model	NVFP4 Model	Δ (s)	Performance Change
1	1.12	1.23	+0.11	+9.8 %
2	1.40	1.45	+0.05	+3.6 %
4	1.66	1.61	–0.05	–3.0 %
8	2.03	1.90	–0.13	–6.4 %
16	2.49	2.15	–0.34	–13.7 %
32	2.94	2.43	–0.51	–17.3 %

Latency Analysis

Full‑precision model is better at low concurrency.
NVFP4 model becomes superior as concurrency increases.

5. Throughput (RPS) – requests / s

Full Model	NVFP Model

Concurrency	Full Model	NVFP4 Model	Δ (RPS)	Performance Change
1	0.90	0.81	–0.09	–10.0 %
2	0.72	0.69	–0.03	–4.2 %
4	0.60	0.62	+0.02	+3.3 %
8	0.49	0.53	+0.04	+8.2 %
16	0.40	0.46	+0.06	+15.0 %
32	0.34	0.41	+0.07	+20.6 %

Throughput Analysis

Full‑precision model wins at very low concurrency.
NVFP4 model surpasses it from concurrency 4 onward, achieving +20.6 % at concurrency 32.

Downloads last month: 17

Safetensors

Model size

241B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4

Base model

Qwen/Qwen3-Coder-480B-A35B-Instruct

Quantized

(34)

this model