pytorch
/

gemma-3-12b-it-AWQ-INT4

@@ -17,6 +17,10 @@ language:
 - **Quantization Method :** AWQ-INT4
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes:
 ```
@@ -212,10 +216,10 @@ and use a token with write access, from https://huggingface.co/settings/tokens
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
-| Benchmark                        |                |                           |
-|----------------------------------|----------------|---------------------------|
-|                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-AWQ-INT4         |
-| mmlu                             | To be filled   | To be filled                      |
 <details>
@@ -243,11 +247,10 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
 ## Results
-| Benchmark        |                |                                |
-|------------------|----------------|--------------------------------|
-|                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-AWQ-INT4              |
-| Peak Memory (GB) | To be filled   | To be filled (?% reduction)    |
 <details>
@@ -302,12 +305,14 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 # Model Performance
-## Results (A100 machine)
-| Benchmark (Latency)              |                |                          |
-|----------------------------------|----------------|--------------------------|
-|                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-AWQ-INT4        |
-| latency (batch_size=1)           | ?s             | ?s (?x speedup)          |
-| latency (batch_size=256)         | ?s             | ?s (?x speedup)          |
 <details>
 <summary> Reproduce Model Performance Results </summary>

 - **Quantization Method :** AWQ-INT4
+Calibrated with 10 samples of `mmlu_abstract_algebra`, got eval accuracy of 42, while gemma-3-12b-it-INT4 is 41, and bfloat16 baseline is 43
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes:
 ```
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check.
+| Benchmark                        |                        |                             |                                 |
+|----------------------------------|------------------------|-----------------------------|---------------------------------|
+|                                  | google/gemma-3-12b-it  | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
+| mmlu_abstract_algebra            | 43                     | 41                          | 42                              |
 <details>
 ## Results
+| Benchmark                        |                        |                             |                                 |
+|----------------------------------|------------------------|-----------------------------|---------------------------------|
+|                                  | google/gemma-3-12b-it  | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
+| Peak Memory (GB)                 | 24.50	                | 8.68 (65% reduction)        | TODO                            |
 <details>
 # Model Performance
+## Results (H100 machine)
+| Benchmark (Latency)              |                        |                             |                                 |
+|----------------------------------|------------------------|-----------------------------|---------------------------------|
+|                                  | google/gemma-3-12b-it  | pytorch/gemma-3-12b-it-INT4 | pytorch/gemma-3-12b-it-AWQ-INT4 |
+| latency (batch_size=1)           | 3.73s	                | TODO (TODO% reduction)      | TODO                            |
+| latency (batch_size=256)         | TODO	                | TODO (TODO% reduction)      | TODO                            |
 <details>
 <summary> Reproduce Model Performance Results </summary>