Qwen3-Coder-30B-A3B-Instruct-NVFP4

quantized to NVFP4 using NVIDIA Model Optimizer.

Quantized by: OPENZEKA

Qwen3‑Coder‑30B‑A3B‑Instruct: Full‑Precision vs. NVFP4‑Quantized Performance Comparison

The full‑precision (FP16/FP32) version of the Qwen/Qwen3‑Coder‑30B‑A3B‑Instruct model was compared against its NVFP4‑quantized version on the same hardware (DGX Spark) and inference engine (vLLM) under identical test conditions:

  • Concurrency levels: 1, 2, 4, 8, 16, 32
  • Prompt length: ≈128 tokens (64 different prompts)
  • Maximum output length: 128 tokens

This comparison clearly demonstrates the speed and efficiency advantages of the quantized model, especially under high concurrency.


Main Findings (Summary)

  • The NVFP4‑quantized model is significantly faster at every concurrency level:
    • TTFT (Time‑to‑First‑Token) is 2–3 × lower.
    • ITL (Inter‑Token‑Latency) is about 50‑60 % lower.
    • TPS (Tokens‑Per‑Second) is 2–2.7 × higher.
    • Total latency is 2–2.6 × shorter.
    • Throughput (RPS) is 2–2.5 × higher.
  • Quantization delivers a large advantage at low‑ and medium‑concurrency workloads and maintains its superiority even at high concurrency (16–32).
  • These results show that NVFP4 quantization is a very effective optimization on NVIDIA hardware (DGX Spark).

Detailed Comparison Tables

Full‑Precision (FP) Model

Concurrency TTFT Mean (ms) TTFT p90 (ms) ITL Mean (ms) TPS Mean (tokens / s) Latency Mean (s) Throughput (RPS)
1 170.78 178.27 32.40 29.86 4.29 0.23
2 101.37 111.49 40.90 24.16 5.23 0.19
4 124.02 171.37 57.31 17.30 7.35 0.14
8 159.58 225.98 77.57 12.79 9.87 0.10
16 179.61 237.59 99.43 9.96 12.36 0.08
32 176.88 234.53 123.04 8.06 15.27 0.07

NVFP4‑Quantized Model

Concurrency TTFT Mean (ms) TTFT p90 (ms) ITL Mean (ms) TPS Mean (tokens / s) Latency Mean (s) Throughput (RPS)
1 66.47 70.55 14.98 64.99 1.97 0.51
2 48.79 55.39 18.07 53.75 2.28 0.44
4 59.03 70.68 23.27 42.44 2.98 0.34
8 76.21 93.75 29.38 33.59 3.72 0.27
16 78.39 98.40 36.21 27.29 4.63 0.22
32 92.31 138.62 45.40 21.75 5.89 0.17

Metric‑by‑Metric Analysis

  1. TTFT (Time‑to‑First‑Token)
    • Concurrency = 1: FP ≈ 171 ms → Quantized ≈ 66 ms (~2.6× faster).
    • Concurrency = 32: FP ≈ 177 ms → Quantized ≈ 92 ms (still ~1.9× faster).
    • The quantized model delivers the first token far more quickly, which dramatically improves user‑perceived latency, especially at low concurrency.
  2. ITL (Inter‑Token Latency)
    • In the FP model, ITL rises sharply with concurrency (up to 123 ms at 32).
    • In the quantized model the increase is much more modest (45 ms at 32), ~2.7× lower than FP.
    • This indicates that the quantized model utilizes memory bandwidth and compute resources far more efficiently.
  3. TPS (Tokens‑Per‑Second)
    • Concurrency = 1: FP 29.9 t/s → Quantized 65 t/s (2.2× increase).
    • Concurrency = 32: FP 8.1 t/s → Quantized 21.8 t/s (2.7× increase).
    • Even under heavy load, the quantized model maintains a far higher token‑generation rate.
  4. Total Latency (for a 128‑token output)
    • FP model ranges from 4.3 s (best) to 15.3 s (worst).
    • Quantized model stays within 2.0 s – 5.9 s, i.e., 2–2.6× faster on average.
  5. Throughput (Requests‑Per‑Second)
    • FP model peaks at ≈ 0.23 RPS (single request).
    • Quantized model reaches 0.51 RPS for a single request and still delivers 0.17 RPS at concurrency = 32 (2–2.5× higher).
    • This enables a service to handle many more concurrent requests.

Conclusion

NVFP4 quantization provides substantial performance gains for the Qwen3‑Coder‑30B‑A3B‑Instruct model. On NVIDIA DGX Spark hardware, it doubles to nearly triples the speed while preserving the same model quality (quality impact was not measured here). The benefits are evident across all concurrency levels, making the quantized version the clear choice for production deployments such as API services, chatbots, and other high‑traffic applications where low latency and high throughput are critical.

In short, the quantized model delivers much lower latency, higher token‑throughput, and considerably higher request‑throughput, confirming that NVFP4 quantization is a highly effective optimization for large language models on modern NVIDIA GPUs.

Downloads last month
52
Safetensors
Model size
16B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support