jaspercatapang commited on
Commit
b3ce9ad
·
1 Parent(s): 9b99d9d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -0
README.md CHANGED
@@ -16,6 +16,21 @@ GodziLLa-30B is an experimental combination of various proprietary Maya LoRAs wi
16
 
17
  ![Godzilla Let Them Fight Meme GIF](https://media.tenor.com/AZkmVImwd5YAAAAC/godzilla-let-them-fight.gif)
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ## Recommended Prompt Format
20
  Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.
21
 
 
16
 
17
  ![Godzilla Let Them Fight Meme GIF](https://media.tenor.com/AZkmVImwd5YAAAAC/godzilla-let-them-fight.gif)
18
 
19
+ ## Open LLM Leaderboard Metrics
20
+ | Metric | Value |
21
+ |-----------------------|-------|
22
+ | MMLU (5-shot) | 55.1 |
23
+ | ARC (25-shot) | 54.2 |
24
+ | HellaSwag (10-shot) | 79.7 |
25
+ | TruthfulQA (0-shot) | 53.3 |
26
+ | Average | 60.6 |
27
+
28
+ According to the leaderboard description, here are the benchmarks used for the evaluation:
29
+ - [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
30
+ - [AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) -ARC- (25-shot) - a set of grade-school science questions.
31
+ - [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
32
+ - [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
33
+
34
  ## Recommended Prompt Format
35
  Alpaca's instruction is the recommended prompt format, but Vicuna's instruction format may also work.
36