Qwen3-Coder-480B-A35B-Instruct Model Comparison Analysis

Test Configuration

  • Full Precision Model: 4× B300 GPU
  • NVFP4 Quantized Model: 2× B300 GPU
  • Tested Concurrency Levels: 1, 2, 4, 8, 16, 32
  • Engine: TensorRT-LLM

Use it with TensorRT-LLM

docker run \
  --gpus all \
  --rm \
  --ipc=host \
  --ulimit memlock=-1:-1 \
  --ulimit stack=67108864 \
  --shm-size=64G \
  -p 8050:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -w /app/tensorrt_llm \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
  trtllm-serve \
   OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4 \
  --host 0.0.0.0 \
  --port 8000 \
  --backend pytorch \
  --max_batch_size 128 \
  --tp_size 2 \
  --kv_cache_free_gpu_memory_fraction 0.9 

Performance Metrics Comparison

1) Time to First Token (TTFT) — ms

Concurrency Full Model NVFP4 Model Diff (NVFP4 - Full) Perf. Loss
1 73.46 92.56 +19.10 +26.0%
2 136.82 173.48 +36.66 +26.8%
4 130.01 163.84 +33.83 +26.0%
8 136.87 177.42 +40.55 +29.6%
16 163.07 174.25 +11.18 +6.9%
32 134.69 169.11 +34.42 +25.6%

TTFT Insights

  • The NVFP4 model shows ~26.5% higher TTFT on average across all concurrency levels.
  • The largest TTFT regression is at concurrency 8 with +29.6%.
  • The best relative TTFT result is at concurrency 16 with +6.9%.

2) Inter-Token Latency (ITL) — ms

Concurrency Full Model NVFP4 Model Diff (NVFP4 - Full) Perf. Loss
1 8.31 8.99 +0.68 +8.2%
2 9.92 10.01 +0.09 +0.9%
4 12.11 11.52 -0.59 -4.9%
8 14.99 13.66 -1.33 -8.9%
16 18.42 15.68 -2.74 -14.9%
32 22.12 18.03 -4.09 -18.5%

ITL Insights

  • At low concurrency (1–2), NVFP4 has slightly higher token-to-token latency.
  • At medium-to-high concurrency (8–32), NVFP4 performs better.
  • At concurrency 32, NVFP4 achieves 18.5% lower ITL.

3) Tokens Per Second (TPS) — tokens/s

Concurrency Full Model NVFP4 Model Diff (NVFP4 - Full) Perf. Loss
1 112.61 103.54 -9.07 -8.1%
2 91.60 88.53 -3.07 -3.3%
4 76.61 78.11 +1.50 +2.0%
8 62.58 66.77 +4.19 +6.7%
16 51.03 58.03 +7.00 +13.7%
32 43.37 51.75 +8.38 +19.3%

TPS Insights

  • At low concurrency (1–2), the full-precision model is better.
  • At medium-to-high concurrency (4–32), NVFP4 achieves higher TPS.
  • At concurrency 32, NVFP4 delivers 19.3% higher TPS.

4) Total Latency — seconds

Concurrency Full Model NVFP4 Model Diff (NVFP4 - Full) Perf. Loss
1 1.12 1.23 +0.11 +9.8%
2 1.40 1.45 +0.05 +3.6%
4 1.66 1.61 -0.05 -3.0%
8 2.03 1.90 -0.13 -6.4%
16 2.49 2.15 -0.34 -13.7%
32 2.94 2.43 -0.51 -17.3%

Latency Insights

  • At low concurrency, the full-precision model is better.
  • At medium-to-high concurrency, NVFP4 provides lower total latency.

5) Throughput (RPS) — requests/s

Concurrency Full Model NVFP4 Model Diff (NVFP4 - Full) Perf. Loss
1 0.90 0.81 -0.09 -10.0%
2 0.72 0.69 -0.03 -4.2%
4 0.60 0.62 +0.02 +3.3%
8 0.49 0.53 +0.04 +8.2%
16 0.40 0.46 +0.06 +15.0%
32 0.34 0.41 +0.07 +20.6%

Throughput Insights

  • At low concurrency, the full-precision model has higher throughput.
  • At medium-to-high concurrency, NVFP4 achieves better throughput overall.
mbatuhanunverdi changed pull request status to closed
mbatuhanunverdi changed pull request status to open
mbatuhanunverdi changed pull request status to closed
birolkuyumcu changed pull request status to open
birolkuyumcu changed pull request status to closed

Sign up or log in to comment