birolkuyumcu commited on
Commit
93bcef9
·
verified ·
1 Parent(s): 696f933

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md CHANGED
@@ -5,4 +5,129 @@ base_model:
5
  ---
6
  Qwen3-Coder-480B-A35B-Instruct Model NVFP4 Quantized
7
 
 
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
  Qwen3-Coder-480B-A35B-Instruct Model NVFP4 Quantized
7
 
8
+ **Qwen3‑Coder‑480B‑A35B‑Instruct Model Comparison Full vs NVFP4**
9
 
10
+ ------
11
+
12
+ ## Test Configuration
13
+
14
+ | Parameter | Setting |
15
+ | ----------------------------- | ----------------------------------- |
16
+ | **Full‑Precision Model** | DGX-B300 / 4 GPU |
17
+ | **NVFP4 Quantized Model** | DGX-B300 / 4 GPU |
18
+ | **Inference Engine** | TRT‑LLM (TensorRT‑LLM) |
19
+ | **Tested Concurrency Levels** | 1, 2, 4, 8, 16, 32 |
20
+ | **Prompt Length** | ≈ 128 tokens (64 different prompts) |
21
+ | **Maximum Response Length** | 128 tokens |
22
+
23
+
24
+ ## Performance Metrics Comparison
25
+
26
+ ### 1. Time to First Token (TTFT) – milliseconds
27
+
28
+ | Full Model | NVFP Model |
29
+ | ------------------------------------------------------------ | ------------------------------------------------------------ |
30
+ | <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-TTFT.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-TTFT.png" style="zoom:50%;"> |
31
+
32
+ | Concurrency | Full Model | NVFP4 Model | Δ (ms) | Performance Loss |
33
+ | ----------- | ---------- | ----------- | ------ | ---------------- |
34
+ | 1 | 73.46 | 92.56 | +19.10 | +26.0 % |
35
+ | 2 | 136.82 | 173.48 | +36.66 | +26.8 % |
36
+ | 4 | 130.01 | 163.84 | +33.83 | +26.0 % |
37
+ | 8 | 136.87 | 177.42 | +40.55 | +29.6 % |
38
+ | 16 | 163.07 | 174.25 | +11.18 | +6.9 % |
39
+ | 32 | 134.69 | 169.11 | +34.42 | +25.6 % |
40
+
41
+ **TTFT Analysis**
42
+
43
+ - The NVFP4 model shows an average **+26.5 %** higher TTFT across all concurrency levels.
44
+ - The greatest performance degradation occurs at concurrency 8 (**+29.6 %**).
45
+ - The smallest degradation is at concurrency 16 (**+6.9 %**).
46
+
47
+ ------
48
+
49
+ ### 2. Inter‑Token Latency (ITL) – milliseconds
50
+
51
+ | Full Model | NVFP Model |
52
+ | ------------------------------------------------------------ | ------------------------------------------------------------ |
53
+ | <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-ITL.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-ITL.png" style="zoom:50%;"> |
54
+
55
+ | Concurrency | Full Model | NVFP4 Model | Δ (ms) | Performance Loss |
56
+ | ----------- | ---------- | ----------- | ------ | ---------------- |
57
+ | 1 | 8.31 | 8.99 | +0.68 | +8.2 % |
58
+ | 2 | 9.92 | 10.01 | +0.09 | +0.9 % |
59
+ | 4 | 12.11 | 11.52 | –0.59 | –4.9 % |
60
+ | 8 | 14.99 | 13.66 | –1.33 | –8.9 % |
61
+ | 16 | 18.42 | 15.68 | –2.74 | –14.9 % |
62
+ | 32 | 22.12 | 18.03 | –4.09 | –18.5 % |
63
+
64
+ **ITL Analysis**
65
+
66
+ - At low concurrency (1‑2) the NVFP4 model is slightly slower.
67
+ - From medium to high concurrency (8‑32) the NVFP4 model **outperforms** the full‑precision model, achieving up to **‑18.5 %** lower latency at concurrency 32.
68
+
69
+ ------
70
+
71
+ ### 3. Tokens Per Second (TPS) – tokens / s
72
+
73
+ | Full Model | NVFP Model |
74
+ | ------------------------------------------------------------ | ------------------------------------------------------------ |
75
+ | <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-TPS.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-TPS.png" style="zoom:50%;"> |
76
+
77
+ | Concurrency | Full Model | NVFP4 Model | Δ (tokens/s) | Performance Change |
78
+ | ----------- | ---------- | ----------- | ------------ | ------------------ |
79
+ | 1 | 112.61 | 103.54 | –9.07 | –8.1 % |
80
+ | 2 | 91.60 | 88.53 | –3.07 | –3.3 % |
81
+ | 4 | 76.61 | 78.11 | +1.50 | +2.0 % |
82
+ | 8 | 62.58 | 66.77 | +4.19 | +6.7 % |
83
+ | 16 | 51.03 | 58.03 | +7.00 | +13.7 % |
84
+ | 32 | 43.37 | 51.75 | +8.38 | +19.3 % |
85
+
86
+ **TPS Analysis**
87
+
88
+ - The full‑precision model is faster at low concurrency (1‑2).
89
+ - From concurrency 4 upward, the NVFP4 model yields higher throughput, reaching **+19.3 %** at concurrency 32.
90
+
91
+ ------
92
+
93
+ ### 4. Total Latency – seconds
94
+
95
+ | Full Model | NVFP Model |
96
+ | ------------------------------------------------------------ | ------------------------------------------------------------ |
97
+ | <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-Latency.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-Latency.png" style="zoom:50%;"> |
98
+
99
+ | Concurrency | Full Model | NVFP4 Model | Δ (s) | Performance Change |
100
+ | ----------- | ---------- | ----------- | ----- | ------------------ |
101
+ | 1 | 1.12 | 1.23 | +0.11 | +9.8 % |
102
+ | 2 | 1.40 | 1.45 | +0.05 | +3.6 % |
103
+ | 4 | 1.66 | 1.61 | –0.05 | –3.0 % |
104
+ | 8 | 2.03 | 1.90 | –0.13 | –6.4 % |
105
+ | 16 | 2.49 | 2.15 | –0.34 | –13.7 % |
106
+ | 32 | 2.94 | 2.43 | –0.51 | –17.3 % |
107
+
108
+ **Latency Analysis**
109
+
110
+ - Full‑precision model is better at low concurrency.
111
+ - NVFP4 model becomes superior as concurrency increases.
112
+
113
+ ------
114
+
115
+ ### 5. Throughput (RPS) – requests / s
116
+
117
+ | Full Model | NVFP Model |
118
+ | ------------------------------------------------------------ | ------------------------------------------------------------ |
119
+ | <img src="Coder480-full/Qwen3-Coder-480B-A35B-Instruct-Throughput.png" style="zoom:50%;"> | <img src="Coder480-nvfp4/Qwen3-Coder-480B-A35B-Instruct-Throughput.png" style="zoom:50%;"> |
120
+
121
+ | Concurrency | Full Model | NVFP4 Model | Δ (RPS) | Performance Change |
122
+ | ----------- | ---------- | ----------- | ------- | ------------------ |
123
+ | 1 | 0.90 | 0.81 | –0.09 | –10.0 % |
124
+ | 2 | 0.72 | 0.69 | –0.03 | –4.2 % |
125
+ | 4 | 0.60 | 0.62 | +0.02 | +3.3 % |
126
+ | 8 | 0.49 | 0.53 | +0.04 | +8.2 % |
127
+ | 16 | 0.40 | 0.46 | +0.06 | +15.0 % |
128
+ | 32 | 0.34 | 0.41 | +0.07 | +20.6 % |
129
+
130
+ **Throughput Analysis**
131
+
132
+ - Full‑precision model wins at very low concurrency.
133
+ - NVFP4 model surpasses it from concurrency 4 onward, achieving **+20.6 %** at concurrency 32.