Qwen3-Coder-480B-A35B-Instruct Model NVFP4 Quantized

Qwen3‑Coder‑480B‑A35B‑Instruct Model Comparison Full vs NVFP4


Test Configuration

Parameter Setting
Full‑Precision Model DGX-B300 / 4 GPU
NVFP4 Quantized Model DGX-B300 / 4 GPU
Inference Engine TRT‑LLM (TensorRT‑LLM)
Tested Concurrency Levels 1, 2, 4, 8, 16, 32
Prompt Length ≈ 128 tokens (64 different prompts)
Maximum Response Length 128 tokens

Performance Metrics Comparison

1. Time to First Token (TTFT) – milliseconds

Full Model NVFP Model
Concurrency Full Model NVFP4 Model Δ (ms) Performance Loss
1 73.46 92.56 +19.10 +26.0 %
2 136.82 173.48 +36.66 +26.8 %
4 130.01 163.84 +33.83 +26.0 %
8 136.87 177.42 +40.55 +29.6 %
16 163.07 174.25 +11.18 +6.9 %
32 134.69 169.11 +34.42 +25.6 %

TTFT Analysis

  • The NVFP4 model shows an average +26.5 % higher TTFT across all concurrency levels.
  • The greatest performance degradation occurs at concurrency 8 (+29.6 %).
  • The smallest degradation is at concurrency 16 (+6.9 %).

2. Inter‑Token Latency (ITL) – milliseconds

Full Model NVFP Model
Concurrency Full Model NVFP4 Model Δ (ms) Performance Loss
1 8.31 8.99 +0.68 +8.2 %
2 9.92 10.01 +0.09 +0.9 %
4 12.11 11.52 –0.59 –4.9 %
8 14.99 13.66 –1.33 –8.9 %
16 18.42 15.68 –2.74 –14.9 %
32 22.12 18.03 –4.09 –18.5 %

ITL Analysis

  • At low concurrency (1‑2) the NVFP4 model is slightly slower.
  • From medium to high concurrency (8‑32) the NVFP4 model outperforms the full‑precision model, achieving up to ‑18.5 % lower latency at concurrency 32.

3. Tokens Per Second (TPS) – tokens / s

Full Model NVFP Model
Concurrency Full Model NVFP4 Model Δ (tokens/s) Performance Change
1 112.61 103.54 –9.07 –8.1 %
2 91.60 88.53 –3.07 –3.3 %
4 76.61 78.11 +1.50 +2.0 %
8 62.58 66.77 +4.19 +6.7 %
16 51.03 58.03 +7.00 +13.7 %
32 43.37 51.75 +8.38 +19.3 %

TPS Analysis

  • The full‑precision model is faster at low concurrency (1‑2).
  • From concurrency 4 upward, the NVFP4 model yields higher throughput, reaching +19.3 % at concurrency 32.

4. Total Latency – seconds

Full Model NVFP Model
Concurrency Full Model NVFP4 Model Δ (s) Performance Change
1 1.12 1.23 +0.11 +9.8 %
2 1.40 1.45 +0.05 +3.6 %
4 1.66 1.61 –0.05 –3.0 %
8 2.03 1.90 –0.13 –6.4 %
16 2.49 2.15 –0.34 –13.7 %
32 2.94 2.43 –0.51 –17.3 %

Latency Analysis

  • Full‑precision model is better at low concurrency.
  • NVFP4 model becomes superior as concurrency increases.

5. Throughput (RPS) – requests / s

Full Model NVFP Model
Concurrency Full Model NVFP4 Model Δ (RPS) Performance Change
1 0.90 0.81 –0.09 –10.0 %
2 0.72 0.69 –0.03 –4.2 %
4 0.60 0.62 +0.02 +3.3 %
8 0.49 0.53 +0.04 +8.2 %
16 0.40 0.46 +0.06 +15.0 %
32 0.34 0.41 +0.07 +20.6 %

Throughput Analysis

  • Full‑precision model wins at very low concurrency.
  • NVFP4 model surpasses it from concurrency 4 onward, achieving +20.6 % at concurrency 32.
Downloads last month
123
Safetensors
Model size
241B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4

Quantized
(35)
this model