Update README.md
Browse files
README.md
CHANGED
|
@@ -5,4 +5,129 @@ base_model:
|
|
| 5 |
---
|
| 6 |
Qwen3-Coder-480B-A35B-Instruct Model NVFP4 Quantized
|
| 7 |
|
|
|
|
| 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
Qwen3-Coder-480B-A35B-Instruct Model NVFP4 Quantized
|
| 7 |
|
| 8 |
+
**Qwen3‑Coder‑480B‑A35B‑Instruct Model Comparison Full vs NVFP4**
|
| 9 |
|
| 10 |
+
------
|
| 11 |
+
|
| 12 |
+
## Test Configuration
|
| 13 |
+
|
| 14 |
+
| Parameter | Setting |
|
| 15 |
+
| ----------------------------- | ----------------------------------- |
|
| 16 |
+
| **Full‑Precision Model** | DGX-B300 / 4 GPU |
|
| 17 |
+
| **NVFP4 Quantized Model** | DGX-B300 / 4 GPU |
|
| 18 |
+
| **Inference Engine** | TRT‑LLM (TensorRT‑LLM) |
|
| 19 |
+
| **Tested Concurrency Levels** | 1, 2, 4, 8, 16, 32 |
|
| 20 |
+
| **Prompt Length** | ≈ 128 tokens (64 different prompts) |
|
| 21 |
+
| **Maximum Response Length** | 128 tokens |
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
## Performance Metrics Comparison
|
| 25 |
+
|
| 26 |
+
### 1. Time to First Token (TTFT) – milliseconds
|
| 27 |
+
|
| 28 |
+
| Full Model | NVFP Model |
|
| 29 |
+
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
| 30 |
+
| <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-TTFT.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-TTFT.png" style="zoom:50%;"> |
|
| 31 |
+
|
| 32 |
+
| Concurrency | Full Model | NVFP4 Model | Δ (ms) | Performance Loss |
|
| 33 |
+
| ----------- | ---------- | ----------- | ------ | ---------------- |
|
| 34 |
+
| 1 | 73.46 | 92.56 | +19.10 | +26.0 % |
|
| 35 |
+
| 2 | 136.82 | 173.48 | +36.66 | +26.8 % |
|
| 36 |
+
| 4 | 130.01 | 163.84 | +33.83 | +26.0 % |
|
| 37 |
+
| 8 | 136.87 | 177.42 | +40.55 | +29.6 % |
|
| 38 |
+
| 16 | 163.07 | 174.25 | +11.18 | +6.9 % |
|
| 39 |
+
| 32 | 134.69 | 169.11 | +34.42 | +25.6 % |
|
| 40 |
+
|
| 41 |
+
**TTFT Analysis**
|
| 42 |
+
|
| 43 |
+
- The NVFP4 model shows an average **+26.5 %** higher TTFT across all concurrency levels.
|
| 44 |
+
- The greatest performance degradation occurs at concurrency 8 (**+29.6 %**).
|
| 45 |
+
- The smallest degradation is at concurrency 16 (**+6.9 %**).
|
| 46 |
+
|
| 47 |
+
------
|
| 48 |
+
|
| 49 |
+
### 2. Inter‑Token Latency (ITL) – milliseconds
|
| 50 |
+
|
| 51 |
+
| Full Model | NVFP Model |
|
| 52 |
+
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
| 53 |
+
| <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-ITL.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-ITL.png" style="zoom:50%;"> |
|
| 54 |
+
|
| 55 |
+
| Concurrency | Full Model | NVFP4 Model | Δ (ms) | Performance Loss |
|
| 56 |
+
| ----------- | ---------- | ----------- | ------ | ---------------- |
|
| 57 |
+
| 1 | 8.31 | 8.99 | +0.68 | +8.2 % |
|
| 58 |
+
| 2 | 9.92 | 10.01 | +0.09 | +0.9 % |
|
| 59 |
+
| 4 | 12.11 | 11.52 | –0.59 | –4.9 % |
|
| 60 |
+
| 8 | 14.99 | 13.66 | –1.33 | –8.9 % |
|
| 61 |
+
| 16 | 18.42 | 15.68 | –2.74 | –14.9 % |
|
| 62 |
+
| 32 | 22.12 | 18.03 | –4.09 | –18.5 % |
|
| 63 |
+
|
| 64 |
+
**ITL Analysis**
|
| 65 |
+
|
| 66 |
+
- At low concurrency (1‑2) the NVFP4 model is slightly slower.
|
| 67 |
+
- From medium to high concurrency (8‑32) the NVFP4 model **outperforms** the full‑precision model, achieving up to **‑18.5 %** lower latency at concurrency 32.
|
| 68 |
+
|
| 69 |
+
------
|
| 70 |
+
|
| 71 |
+
### 3. Tokens Per Second (TPS) – tokens / s
|
| 72 |
+
|
| 73 |
+
| Full Model | NVFP Model |
|
| 74 |
+
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
| 75 |
+
| <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-TPS.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-TPS.png" style="zoom:50%;"> |
|
| 76 |
+
|
| 77 |
+
| Concurrency | Full Model | NVFP4 Model | Δ (tokens/s) | Performance Change |
|
| 78 |
+
| ----------- | ---------- | ----------- | ------------ | ------------------ |
|
| 79 |
+
| 1 | 112.61 | 103.54 | –9.07 | –8.1 % |
|
| 80 |
+
| 2 | 91.60 | 88.53 | –3.07 | –3.3 % |
|
| 81 |
+
| 4 | 76.61 | 78.11 | +1.50 | +2.0 % |
|
| 82 |
+
| 8 | 62.58 | 66.77 | +4.19 | +6.7 % |
|
| 83 |
+
| 16 | 51.03 | 58.03 | +7.00 | +13.7 % |
|
| 84 |
+
| 32 | 43.37 | 51.75 | +8.38 | +19.3 % |
|
| 85 |
+
|
| 86 |
+
**TPS Analysis**
|
| 87 |
+
|
| 88 |
+
- The full‑precision model is faster at low concurrency (1‑2).
|
| 89 |
+
- From concurrency 4 upward, the NVFP4 model yields higher throughput, reaching **+19.3 %** at concurrency 32.
|
| 90 |
+
|
| 91 |
+
------
|
| 92 |
+
|
| 93 |
+
### 4. Total Latency – seconds
|
| 94 |
+
|
| 95 |
+
| Full Model | NVFP Model |
|
| 96 |
+
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
| 97 |
+
| <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-Latency.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-Latency.png" style="zoom:50%;"> |
|
| 98 |
+
|
| 99 |
+
| Concurrency | Full Model | NVFP4 Model | Δ (s) | Performance Change |
|
| 100 |
+
| ----------- | ---------- | ----------- | ----- | ------------------ |
|
| 101 |
+
| 1 | 1.12 | 1.23 | +0.11 | +9.8 % |
|
| 102 |
+
| 2 | 1.40 | 1.45 | +0.05 | +3.6 % |
|
| 103 |
+
| 4 | 1.66 | 1.61 | –0.05 | –3.0 % |
|
| 104 |
+
| 8 | 2.03 | 1.90 | –0.13 | –6.4 % |
|
| 105 |
+
| 16 | 2.49 | 2.15 | –0.34 | –13.7 % |
|
| 106 |
+
| 32 | 2.94 | 2.43 | –0.51 | –17.3 % |
|
| 107 |
+
|
| 108 |
+
**Latency Analysis**
|
| 109 |
+
|
| 110 |
+
- Full‑precision model is better at low concurrency.
|
| 111 |
+
- NVFP4 model becomes superior as concurrency increases.
|
| 112 |
+
|
| 113 |
+
------
|
| 114 |
+
|
| 115 |
+
### 5. Throughput (RPS) – requests / s
|
| 116 |
+
|
| 117 |
+
| Full Model | NVFP Model |
|
| 118 |
+
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
| 119 |
+
| <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-Throughput.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-Throughput.png" style="zoom:50%;"> |
|
| 120 |
+
|
| 121 |
+
| Concurrency | Full Model | NVFP4 Model | Δ (RPS) | Performance Change |
|
| 122 |
+
| ----------- | ---------- | ----------- | ------- | ------------------ |
|
| 123 |
+
| 1 | 0.90 | 0.81 | –0.09 | –10.0 % |
|
| 124 |
+
| 2 | 0.72 | 0.69 | –0.03 | –4.2 % |
|
| 125 |
+
| 4 | 0.60 | 0.62 | +0.02 | +3.3 % |
|
| 126 |
+
| 8 | 0.49 | 0.53 | +0.04 | +8.2 % |
|
| 127 |
+
| 16 | 0.40 | 0.46 | +0.06 | +15.0 % |
|
| 128 |
+
| 32 | 0.34 | 0.41 | +0.07 | +20.6 % |
|
| 129 |
+
|
| 130 |
+
**Throughput Analysis**
|
| 131 |
+
|
| 132 |
+
- Full‑precision model wins at very low concurrency.
|
| 133 |
+
- NVFP4 model surpasses it from concurrency 4 onward, achieving **+20.6 %** at concurrency 32.
|