Update README.md
Browse files
README.md
CHANGED
|
@@ -174,6 +174,42 @@ metrics to cover different aspects of text generation. Evaluation results marked
|
|
| 174 |
with **IT** are for instruction-tuned models. Evaluation results marked with
|
| 175 |
**PT** are for pre-trained models.
|
| 176 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
#### Gemma 3 1B, 4B, 12B & 27B
|
| 178 |
|
| 179 |
##### Reasoning and factuality
|
|
@@ -327,42 +363,6 @@ with **IT** are for instruction-tuned models. Evaluation results marked with
|
|
| 327 |
[countbenchqa]: https://github.com/google-research/big_vision/blob/main/big_vision/datasets/countbenchqa/
|
| 328 |
[mathvista]: https://arxiv.org/abs/2310.02255
|
| 329 |
|
| 330 |
-
#### Gemma 3 270M
|
| 331 |
-
|
| 332 |
-
| **Benchmark** | **n-shot** | **Gemma 3 PT 270M** |
|
| 333 |
-
| :------------------------ | :-----------: | ------------------: |
|
| 334 |
-
| [HellaSwag][hellaswag] | 10-shot | 40.9 |
|
| 335 |
-
| [BoolQ][boolq] | 0-shot | 61.4 |
|
| 336 |
-
| [PIQA][piqa] | 0-shot | 67.7 |
|
| 337 |
-
| [TriviaQA][triviaqa] | 5-shot | 15.4 |
|
| 338 |
-
| [ARC-c][arc] | 25-shot | 29.0 |
|
| 339 |
-
| [ARC-e][arc] | 0-shot | 57.7 |
|
| 340 |
-
| [WinoGrande][winogrande] | 5-shot | 52.0 |
|
| 341 |
-
|
| 342 |
-
[hellaswag]: https://arxiv.org/abs/1905.07830
|
| 343 |
-
[boolq]: https://arxiv.org/abs/1905.10044
|
| 344 |
-
[piqa]: https://arxiv.org/abs/1911.11641
|
| 345 |
-
[triviaqa]: https://arxiv.org/abs/1705.03551
|
| 346 |
-
[arc]: https://arxiv.org/abs/1911.01547
|
| 347 |
-
[winogrande]: https://arxiv.org/abs/1907.10641
|
| 348 |
-
|
| 349 |
-
| **Benchmark** | **n-shot** | **Gemma 3 IT 270m** |
|
| 350 |
-
| :------------------------ | :-----------: | ------------------: |
|
| 351 |
-
| [HellaSwag][hellaswag] | 0-shot | 37.7 |
|
| 352 |
-
| [PIQA][piqa] | 0-shot | 66.2 |
|
| 353 |
-
| [ARC-c][arc] | 0-shot | 28.2 |
|
| 354 |
-
| [WinoGrande][winogrande] | 0-shot | 52.3 |
|
| 355 |
-
| [BIG-Bench Hard][bbh] | few-shot | 26.7 |
|
| 356 |
-
| [IF Eval][ifeval] | 0-shot | 51.2 |
|
| 357 |
-
|
| 358 |
-
[hellaswag]: https://arxiv.org/abs/1905.07830
|
| 359 |
-
[piqa]: https://arxiv.org/abs/1911.11641
|
| 360 |
-
[arc]: https://arxiv.org/abs/1911.01547
|
| 361 |
-
[winogrande]: https://arxiv.org/abs/1907.10641
|
| 362 |
-
[bbh]: https://paperswithcode.com/dataset/bbh
|
| 363 |
-
[bbh]: https://paperswithcode.com/dataset/bbh
|
| 364 |
-
[ifeval]: https://arxiv.org/abs/2311.07911
|
| 365 |
-
|
| 366 |
## Ethics and Safety
|
| 367 |
|
| 368 |
Ethics and safety evaluation approach and results.
|
|
@@ -528,5 +528,3 @@ alternatives.
|
|
| 528 |
[ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/
|
| 529 |
[sustainability]: https://sustainability.google/operating-sustainably/
|
| 530 |
[gemini-2-paper]: https://arxiv.org/abs/2312.11805
|
| 531 |
-
|
| 532 |
-
|
|
|
|
| 174 |
with **IT** are for instruction-tuned models. Evaluation results marked with
|
| 175 |
**PT** are for pre-trained models.
|
| 176 |
|
| 177 |
+
#### Gemma 3 270M
|
| 178 |
+
|
| 179 |
+
| **Benchmark** | **n-shot** | **Gemma 3 PT 270M** |
|
| 180 |
+
| :------------------------ | :-----------: | ------------------: |
|
| 181 |
+
| [HellaSwag][hellaswag] | 10-shot | 40.9 |
|
| 182 |
+
| [BoolQ][boolq] | 0-shot | 61.4 |
|
| 183 |
+
| [PIQA][piqa] | 0-shot | 67.7 |
|
| 184 |
+
| [TriviaQA][triviaqa] | 5-shot | 15.4 |
|
| 185 |
+
| [ARC-c][arc] | 25-shot | 29.0 |
|
| 186 |
+
| [ARC-e][arc] | 0-shot | 57.7 |
|
| 187 |
+
| [WinoGrande][winogrande] | 5-shot | 52.0 |
|
| 188 |
+
|
| 189 |
+
[hellaswag]: https://arxiv.org/abs/1905.07830
|
| 190 |
+
[boolq]: https://arxiv.org/abs/1905.10044
|
| 191 |
+
[piqa]: https://arxiv.org/abs/1911.11641
|
| 192 |
+
[triviaqa]: https://arxiv.org/abs/1705.03551
|
| 193 |
+
[arc]: https://arxiv.org/abs/1911.01547
|
| 194 |
+
[winogrande]: https://arxiv.org/abs/1907.10641
|
| 195 |
+
|
| 196 |
+
| **Benchmark** | **n-shot** | **Gemma 3 IT 270m** |
|
| 197 |
+
| :------------------------ | :-----------: | ------------------: |
|
| 198 |
+
| [HellaSwag][hellaswag] | 0-shot | 37.7 |
|
| 199 |
+
| [PIQA][piqa] | 0-shot | 66.2 |
|
| 200 |
+
| [ARC-c][arc] | 0-shot | 28.2 |
|
| 201 |
+
| [WinoGrande][winogrande] | 0-shot | 52.3 |
|
| 202 |
+
| [BIG-Bench Hard][bbh] | few-shot | 26.7 |
|
| 203 |
+
| [IF Eval][ifeval] | 0-shot | 51.2 |
|
| 204 |
+
|
| 205 |
+
[hellaswag]: https://arxiv.org/abs/1905.07830
|
| 206 |
+
[piqa]: https://arxiv.org/abs/1911.11641
|
| 207 |
+
[arc]: https://arxiv.org/abs/1911.01547
|
| 208 |
+
[winogrande]: https://arxiv.org/abs/1907.10641
|
| 209 |
+
[bbh]: https://paperswithcode.com/dataset/bbh
|
| 210 |
+
[bbh]: https://paperswithcode.com/dataset/bbh
|
| 211 |
+
[ifeval]: https://arxiv.org/abs/2311.07911
|
| 212 |
+
|
| 213 |
#### Gemma 3 1B, 4B, 12B & 27B
|
| 214 |
|
| 215 |
##### Reasoning and factuality
|
|
|
|
| 363 |
[countbenchqa]: https://github.com/google-research/big_vision/blob/main/big_vision/datasets/countbenchqa/
|
| 364 |
[mathvista]: https://arxiv.org/abs/2310.02255
|
| 365 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 366 |
## Ethics and Safety
|
| 367 |
|
| 368 |
Ethics and safety evaluation approach and results.
|
|
|
|
| 528 |
[ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/
|
| 529 |
[sustainability]: https://sustainability.google/operating-sustainably/
|
| 530 |
[gemini-2-paper]: https://arxiv.org/abs/2312.11805
|
|
|
|
|
|