Update ReadME.md
#1
by
mbatuhanunverdi
- opened
Qwen3-Coder-480B-A35B-Instruct Model Comparison Analysis
Test Configuration
- Full Precision Model: 4× B300 GPU
- NVFP4 Quantized Model: 2× B300 GPU
- Tested Concurrency Levels: 1, 2, 4, 8, 16, 32
- Engine: TensorRT-LLM
Use it with TensorRT-LLM
docker run \
--gpus all \
--rm \
--ipc=host \
--ulimit memlock=-1:-1 \
--ulimit stack=67108864 \
--shm-size=64G \
-p 8050:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-w /app/tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
trtllm-serve \
OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--tp_size 2 \
--kv_cache_free_gpu_memory_fraction 0.9
Performance Metrics Comparison
1) Time to First Token (TTFT) — ms
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---|---|---|---|---|
| 1 | 73.46 | 92.56 | +19.10 | +26.0% |
| 2 | 136.82 | 173.48 | +36.66 | +26.8% |
| 4 | 130.01 | 163.84 | +33.83 | +26.0% |
| 8 | 136.87 | 177.42 | +40.55 | +29.6% |
| 16 | 163.07 | 174.25 | +11.18 | +6.9% |
| 32 | 134.69 | 169.11 | +34.42 | +25.6% |
TTFT Insights
- The NVFP4 model shows ~26.5% higher TTFT on average across all concurrency levels.
- The largest TTFT regression is at concurrency 8 with +29.6%.
- The best relative TTFT result is at concurrency 16 with +6.9%.
2) Inter-Token Latency (ITL) — ms
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---|---|---|---|---|
| 1 | 8.31 | 8.99 | +0.68 | +8.2% |
| 2 | 9.92 | 10.01 | +0.09 | +0.9% |
| 4 | 12.11 | 11.52 | -0.59 | -4.9% |
| 8 | 14.99 | 13.66 | -1.33 | -8.9% |
| 16 | 18.42 | 15.68 | -2.74 | -14.9% |
| 32 | 22.12 | 18.03 | -4.09 | -18.5% |
ITL Insights
- At low concurrency (1–2), NVFP4 has slightly higher token-to-token latency.
- At medium-to-high concurrency (8–32), NVFP4 performs better.
- At concurrency 32, NVFP4 achieves 18.5% lower ITL.
3) Tokens Per Second (TPS) — tokens/s
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---|---|---|---|---|
| 1 | 112.61 | 103.54 | -9.07 | -8.1% |
| 2 | 91.60 | 88.53 | -3.07 | -3.3% |
| 4 | 76.61 | 78.11 | +1.50 | +2.0% |
| 8 | 62.58 | 66.77 | +4.19 | +6.7% |
| 16 | 51.03 | 58.03 | +7.00 | +13.7% |
| 32 | 43.37 | 51.75 | +8.38 | +19.3% |
TPS Insights
- At low concurrency (1–2), the full-precision model is better.
- At medium-to-high concurrency (4–32), NVFP4 achieves higher TPS.
- At concurrency 32, NVFP4 delivers 19.3% higher TPS.
4) Total Latency — seconds
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---|---|---|---|---|
| 1 | 1.12 | 1.23 | +0.11 | +9.8% |
| 2 | 1.40 | 1.45 | +0.05 | +3.6% |
| 4 | 1.66 | 1.61 | -0.05 | -3.0% |
| 8 | 2.03 | 1.90 | -0.13 | -6.4% |
| 16 | 2.49 | 2.15 | -0.34 | -13.7% |
| 32 | 2.94 | 2.43 | -0.51 | -17.3% |
Latency Insights
- At low concurrency, the full-precision model is better.
- At medium-to-high concurrency, NVFP4 provides lower total latency.
5) Throughput (RPS) — requests/s
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---|---|---|---|---|
| 1 | 0.90 | 0.81 | -0.09 | -10.0% |
| 2 | 0.72 | 0.69 | -0.03 | -4.2% |
| 4 | 0.60 | 0.62 | +0.02 | +3.3% |
| 8 | 0.49 | 0.53 | +0.04 | +8.2% |
| 16 | 0.40 | 0.46 | +0.06 | +15.0% |
| 32 | 0.34 | 0.41 | +0.07 | +20.6% |
Throughput Insights
- At low concurrency, the full-precision model has higher throughput.
- At medium-to-high concurrency, NVFP4 achieves better throughput overall.
mbatuhanunverdi
changed pull request status to
closed
mbatuhanunverdi
changed pull request status to
open
mbatuhanunverdi
changed pull request status to
closed
birolkuyumcu
changed pull request status to
open
birolkuyumcu
changed pull request status to
closed