Update README.md
Browse files
README.md
CHANGED
|
@@ -17,6 +17,10 @@ language:
|
|
| 17 |
- **Quantization Method :** AWQ-INT4
|
| 18 |
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
# Inference with vLLM
|
| 21 |
Install vllm nightly and torchao nightly to get some recent changes:
|
| 22 |
```
|
|
@@ -212,10 +216,10 @@ and use a token with write access, from https://huggingface.co/settings/tokens
|
|
| 212 |
# Model Quality
|
| 213 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
|
| 214 |
|
| 215 |
-
| Benchmark |
|
| 216 |
-
|
| 217 |
-
| | google/gemma-3-12b-it
|
| 218 |
-
|
|
| 219 |
|
| 220 |
|
| 221 |
<details>
|
|
@@ -243,11 +247,10 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
|
|
| 243 |
|
| 244 |
## Results
|
| 245 |
|
| 246 |
-
| Benchmark
|
| 247 |
-
|
| 248 |
-
|
|
| 249 |
-
| Peak Memory (GB)
|
| 250 |
-
|
| 251 |
|
| 252 |
|
| 253 |
<details>
|
|
@@ -302,12 +305,14 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
| 302 |
|
| 303 |
# Model Performance
|
| 304 |
|
| 305 |
-
## Results (
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
|
| 309 |
-
|
| 310 |
-
|
|
|
|
|
|
|
|
| 311 |
|
| 312 |
<details>
|
| 313 |
<summary> Reproduce Model Performance Results </summary>
|
|
|
|
| 17 |
- **Quantization Method :** AWQ-INT4
|
| 18 |
|
| 19 |
|
| 20 |
+
|
| 21 |
+
Calibrated with 10 samples of `mmlu_abstract_algebra`, got eval accuracy of 42, while gemma-3-12b-it-INT4 is 41, and bfloat16 baseline is 43
|
| 22 |
+
|
| 23 |
+
|
| 24 |
# Inference with vLLM
|
| 25 |
Install vllm nightly and torchao nightly to get some recent changes:
|
| 26 |
```
|
|
|
|
| 216 |
# Model Quality
|
| 217 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
|
| 218 |
|
| 219 |
+
| Benchmark | | | |
|
| 220 |
+
|----------------------------------|------------------------|-----------------------------|---------------------------------|
|
| 221 |
+
| | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 222 |
+
| mmlu_abstract_algebra | 43 | 41 | 42 |
|
| 223 |
|
| 224 |
|
| 225 |
<details>
|
|
|
|
| 247 |
|
| 248 |
## Results
|
| 249 |
|
| 250 |
+
| Benchmark | | | |
|
| 251 |
+
|----------------------------------|------------------------|-----------------------------|---------------------------------|
|
| 252 |
+
| | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 253 |
+
| Peak Memory (GB) | 24.50 | 8.68 (65% reduction) | TODO |
|
|
|
|
| 254 |
|
| 255 |
|
| 256 |
<details>
|
|
|
|
| 305 |
|
| 306 |
# Model Performance
|
| 307 |
|
| 308 |
+
## Results (H100 machine)
|
| 309 |
+
|
| 310 |
+
|
| 311 |
+
| Benchmark (Latency) | | | |
|
| 312 |
+
|----------------------------------|------------------------|-----------------------------|---------------------------------|
|
| 313 |
+
| | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
|
| 314 |
+
| latency (batch_size=1) | 3.73s | TODO (TODO% reduction) | TODO |
|
| 315 |
+
| latency (batch_size=256) | TODO | TODO (TODO% reduction) | TODO |
|
| 316 |
|
| 317 |
<details>
|
| 318 |
<summary> Reproduce Model Performance Results </summary>
|