Update ReadME.md
Browse files# Qwen3-Coder-480B-A35B-Instruct Model Comparison Analysis
## Test Configuration
- **Full Precision Model:** 4× B300 GPU
- **NVFP4 Quantized Model:** 2× B300 GPU
- **Tested Concurrency Levels:** 1, 2, 4, 8, 16, 32
- **Engine:** TensorRT-LLM
---
## Use it with TensorRT-LLM
```
docker run \
--gpus all \
--rm \
--ipc=host \
--ulimit memlock=-1:-1 \
--ulimit stack=67108864 \
--shm-size=64G \
-p 8050:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-w /app/tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
trtllm-serve \
OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--tp_size 2 \
--kv_cache_free_gpu_memory_fraction 0.9
```
## Performance Metrics Comparison
### 1) Time to First Token (TTFT) — ms
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 73.46 | 92.56 | +19.10 | +26.0% |
| 2 | 136.82 | 173.48 | +36.66 | +26.8% |
| 4 | 130.01 | 163.84 | +33.83 | +26.0% |
| 8 | 136.87 | 177.42 | +40.55 | +29.6% |
| 16 | 163.07 | 174.25 | +11.18 | +6.9% |
| 32 | 134.69 | 169.11 | +34.42 | +25.6% |
**TTFT Insights**
- The NVFP4 model shows **~26.5% higher TTFT on average** across all concurrency levels.
- The largest TTFT regression is at **concurrency 8** with **+29.6%**.
- The best relative TTFT result is at **concurrency 16** with **+6.9%**.
---
### 2) Inter-Token Latency (ITL) — ms
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 8.31 | 8.99 | +0.68 | +8.2% |
| 2 | 9.92 | 10.01 | +0.09 | +0.9% |
| 4 | 12.11 | 11.52 | -0.59 | -4.9% |
| 8 | 14.99 | 13.66 | -1.33 | -8.9% |
| 16 | 18.42 | 15.68 | -2.74 | -14.9% |
| 32 | 22.12 | 18.03 | -4.09 | -18.5% |
**ITL Insights**
- At low concurrency (1–2), NVFP4 has **slightly higher** token-to-token latency.
- At medium-to-high concurrency (8–32), NVFP4 performs **better**.
- At **concurrency 32**, NVFP4 achieves **18.5% lower ITL**.
---
### 3) Tokens Per Second (TPS) — tokens/s
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 112.61 | 103.54 | -9.07 | -8.1% |
| 2 | 91.60 | 88.53 | -3.07 | -3.3% |
| 4 | 76.61 | 78.11 | +1.50 | +2.0% |
| 8 | 62.58 | 66.77 | +4.19 | +6.7% |
| 16 | 51.03 | 58.03 | +7.00 | +13.7% |
| 32 | 43.37 | 51.75 | +8.38 | +19.3% |
**TPS Insights**
- At low concurrency (1–2), the full-precision model is better.
- At medium-to-high concurrency (4–32), NVFP4 achieves **higher TPS**.
- At **concurrency 32**, NVFP4 delivers **19.3% higher TPS**.
---
### 4) Total Latency — seconds
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 1.12 | 1.23 | +0.11 | +9.8% |
| 2 | 1.40 | 1.45 | +0.05 | +3.6% |
| 4 | 1.66 | 1.61 | -0.05 | -3.0% |
| 8 | 2.03 | 1.90 | -0.13 | -6.4% |
| 16 | 2.49 | 2.15 | -0.34 | -13.7% |
| 32 | 2.94 | 2.43 | -0.51 | -17.3% |
**Latency Insights**
- At low concurrency, the full-precision model is better.
- At medium-to-high concurrency, NVFP4 provides **lower total latency**.
---
### 5) Throughput (RPS) — requests/s
| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 0.90 | 0.81 | -0.09 | -10.0% |
| 2 | 0.72 | 0.69 | -0.03 | -4.2% |
| 4 | 0.60 | 0.62 | +0.02 | +3.3% |
| 8 | 0.49 | 0.53 | +0.04 | +8.2% |
| 16 | 0.40 | 0.46 | +0.06 | +15.0% |
| 32 | 0.34 | 0.41 | +0.07 | +20.6% |
**Throughput Insights**
- At low concurrency, the full-precision model has higher throughput.
- At medium-to-high concurrency, NVFP4 achieves better throughput overall.
|
@@ -1,3 +1,9 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
license_link: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/blob/main/LICENSE
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
tags:
|
| 7 |
+
- nvfp4
|
| 8 |
+
base_model: Qwen/Qwen3-Coder-480B-A35B-Instruct
|
| 9 |
+
---
|