Commit ·
b3ce9ad
1
Parent(s): 9b99d9d
Update README.md
Browse files
README.md
CHANGED
|
@@ -16,6 +16,21 @@ GodziLLa-30B is an experimental combination of various proprietary Maya LoRAs wi
|
|
| 16 |
|
| 17 |

|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
## Recommended Prompt Format
|
| 20 |
Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.
|
| 21 |
|
|
|
|
| 16 |
|
| 17 |

|
| 18 |
|
| 19 |
+
## Open LLM Leaderboard Metrics
|
| 20 |
+
| Metric | Value |
|
| 21 |
+
|-----------------------|-------|
|
| 22 |
+
| MMLU (5-shot) | 55.1 |
|
| 23 |
+
| ARC (25-shot) | 54.2 |
|
| 24 |
+
| HellaSwag (10-shot) | 79.7 |
|
| 25 |
+
| TruthfulQA (0-shot) | 53.3 |
|
| 26 |
+
| Average | 60.6 |
|
| 27 |
+
|
| 28 |
+
According to the leaderboard description, here are the benchmarks used for the evaluation:
|
| 29 |
+
- [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
| 30 |
+
- [AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) -ARC- (25-shot) - a set of grade-school science questions.
|
| 31 |
+
- [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
| 32 |
+
- [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
|
| 33 |
+
|
| 34 |
## Recommended Prompt Format
|
| 35 |
Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.
|
| 36 |
|