jerryzh168 commited on
Commit
5f05229
·
verified ·
1 Parent(s): 0aa9f36

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -15
README.md CHANGED
@@ -17,6 +17,10 @@ language:
17
  - **Quantization Method :** AWQ-INT4
18
 
19
 
 
 
 
 
20
  # Inference with vLLM
21
  Install vllm nightly and torchao nightly to get some recent changes:
22
  ```
@@ -212,10 +216,10 @@ and use a token with write access, from https://huggingface.co/settings/tokens
212
  # Model Quality
213
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
214
 
215
- | Benchmark | | |
216
- |----------------------------------|----------------|---------------------------|
217
- | | google/gemma-3-12b-it | jerryzh168/gemma-3-12b-it-AWQ-INT4 |
218
- | mmlu | To be filled | To be filled |
219
 
220
 
221
  <details>
@@ -243,11 +247,10 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
243
 
244
  ## Results
245
 
246
- | Benchmark | | |
247
- |------------------|----------------|--------------------------------|
248
- | | google/gemma-3-12b-it | jerryzh168/gemma-3-12b-it-AWQ-INT4 |
249
- | Peak Memory (GB) | To be filled | To be filled (?% reduction) |
250
-
251
 
252
 
253
  <details>
@@ -302,12 +305,14 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
302
 
303
  # Model Performance
304
 
305
- ## Results (A100 machine)
306
- | Benchmark (Latency) | | |
307
- |----------------------------------|----------------|--------------------------|
308
- | | google/gemma-3-12b-it | jerryzh168/gemma-3-12b-it-AWQ-INT4 |
309
- | latency (batch_size=1) | ?s | ?s (?x speedup) |
310
- | latency (batch_size=256) | ?s | ?s (?x speedup) |
 
 
311
 
312
  <details>
313
  <summary> Reproduce Model Performance Results </summary>
 
17
  - **Quantization Method :** AWQ-INT4
18
 
19
 
20
+
21
+ Calibrated with 10 samples of `mmlu_abstract_algebra`, got eval accuracy of 42, while gemma-3-12b-it-INT4 is 41, and bfloat16 baseline is 43
22
+
23
+
24
  # Inference with vLLM
25
  Install vllm nightly and torchao nightly to get some recent changes:
26
  ```
 
216
  # Model Quality
217
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
218
 
219
+ | Benchmark | | | |
220
+ |----------------------------------|------------------------|-----------------------------|---------------------------------|
221
+ | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
222
+ | mmlu_abstract_algebra | 43 | 41 | 42 |
223
 
224
 
225
  <details>
 
247
 
248
  ## Results
249
 
250
+ | Benchmark | | | |
251
+ |----------------------------------|------------------------|-----------------------------|---------------------------------|
252
+ | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
253
+ | Peak Memory (GB) | 24.50 | 8.68 (65% reduction) | TODO |
 
254
 
255
 
256
  <details>
 
305
 
306
  # Model Performance
307
 
308
+ ## Results (H100 machine)
309
+
310
+
311
+ | Benchmark (Latency) | | | |
312
+ |----------------------------------|------------------------|-----------------------------|---------------------------------|
313
+ | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
314
+ | latency (batch_size=1) | 3.73s | TODO (TODO% reduction) | TODO |
315
+ | latency (batch_size=256) | TODO | TODO (TODO% reduction) | TODO |
316
 
317
  <details>
318
  <summary> Reproduce Model Performance Results </summary>