Commit ·
6696617
1
Parent(s): c091ae8
Update README.md
Browse files
README.md
CHANGED
|
@@ -14,33 +14,46 @@ datasets:
|
|
| 14 |
Released August 11, 2023
|
| 15 |
|
| 16 |
## Model Description
|
| 17 |
-
GodziLLa 2 70B is an experimental combination of various proprietary LoRAs from Maya Philippines and [Guanaco LLaMA 2 1K dataset](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k), with LLaMA 2 70B. This model's primary purpose is to stress test the limitations of composite, instruction-following LLMs and observe its performance with respect to other LLMs available on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). This model debuted in the leaderboard at rank #4 (August 17, 2023) and operates under the Llama 2 license.
|
| 18 |

|
| 19 |
|
| 20 |
-
## Open LLM Leaderboard Metrics
|
| 21 |
| Metric | Value |
|
| 22 |
|-----------------------|-------|
|
| 23 |
| MMLU (5-shot) | 69.88 |
|
| 24 |
| ARC (25-shot) | 71.42 |
|
| 25 |
| HellaSwag (10-shot) | 87.53 |
|
| 26 |
| TruthfulQA (0-shot) | 61.54 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
According to the leaderboard description, here are the benchmarks used for the evaluation:
|
| 30 |
- [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
| 31 |
- [AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) -ARC- (25-shot) - a set of grade-school science questions.
|
| 32 |
- [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
| 33 |
- [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
A detailed breakdown of the evaluation can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_MayaPH__GodziLLa2-70B). Huge thanks to [@thomwolf](https://huggingface.co/thomwolf).
|
| 36 |
|
| 37 |
-
## Leaderboard
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
*Based on a [leaderboard clone](https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard) with GPT-3.5 and GPT-4 included.
|
| 46 |
|
|
|
|
| 14 |
Released August 11, 2023
|
| 15 |
|
| 16 |
## Model Description
|
| 17 |
+
GodziLLa 2 70B is an experimental combination of various proprietary LoRAs from Maya Philippines and [Guanaco LLaMA 2 1K dataset](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k), with LLaMA 2 70B. This model's primary purpose is to stress test the limitations of composite, instruction-following LLMs and observe its performance with respect to other LLMs available on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). This model debuted in the leaderboard at rank #4 (August 17, 2023), debuted in the Fall 2023 update at rank #2 (November, 10, 2023), and operates under the Llama 2 license.
|
| 18 |

|
| 19 |
|
| 20 |
+
## Open LLM Leaderboard Metrics (Fall 2023 update)
|
| 21 |
| Metric | Value |
|
| 22 |
|-----------------------|-------|
|
| 23 |
| MMLU (5-shot) | 69.88 |
|
| 24 |
| ARC (25-shot) | 71.42 |
|
| 25 |
| HellaSwag (10-shot) | 87.53 |
|
| 26 |
| TruthfulQA (0-shot) | 61.54 |
|
| 27 |
+
| Winogrande (5-shot) | 83.19 |
|
| 28 |
+
| GSM8K (5-shot) | 43.21 |
|
| 29 |
+
| DROP (3-shot) | 52.31 |
|
| 30 |
+
| Average | 67.01 |
|
| 31 |
|
| 32 |
According to the leaderboard description, here are the benchmarks used for the evaluation:
|
| 33 |
- [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
| 34 |
- [AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) -ARC- (25-shot) - a set of grade-school science questions.
|
| 35 |
- [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
| 36 |
- [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
|
| 37 |
+
- [Winogrande](https://arxiv.org/abs/1907.10641) (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
|
| 38 |
+
- [GSM8k](https://arxiv.org/abs/2110.14168) (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
|
| 39 |
+
- [DROP](https://arxiv.org/abs/1903.00161) (3-shot) - English reading comprehension benchmark requiring Discrete Reasoning Over the content of Paragraphs.
|
| 40 |
|
| 41 |
A detailed breakdown of the evaluation can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_MayaPH__GodziLLa2-70B). Huge thanks to [@thomwolf](https://huggingface.co/thomwolf).
|
| 42 |
|
| 43 |
+
## Open LLM Leaderboard Metrics (before Fall 2023 update)
|
| 44 |
+
| Metric | Value |
|
| 45 |
+
|-----------------------|-------|
|
| 46 |
+
| MMLU (5-shot) | 69.88 |
|
| 47 |
+
| ARC (25-shot) | 71.42 |
|
| 48 |
+
| HellaSwag (10-shot) | 87.53 |
|
| 49 |
+
| TruthfulQA (0-shot) | 61.54 |
|
| 50 |
+
| Average | 72.59 |
|
| 51 |
+
|
| 52 |
+
## Leaderboard Highlights (Fall 2023 update, November 10, 2023)
|
| 53 |
+
- Godzilla 2 70B debuts at 2nd place worldwide in the newly updated Open LLM Leaderboard.
|
| 54 |
+
- Godzilla 2 70B beats GPT-3.5 (ChatGPT) in terms of average performance and the HellaSwag benchmark (87.53 > 85.5).
|
| 55 |
+
- Godzilla 2 70B outperforms GPT-3.5 (ChatGPT) and GPT-4 on the TruthfulQA benchmark (61.54 for G2-70B, 47 for GPT-3.5, 59 for GPT-4).
|
| 56 |
+
- Godzilla 2 70B is on par with GPT-3.5 (ChatGPT) on the MMLU benchmark (<0.12%).
|
| 57 |
|
| 58 |
*Based on a [leaderboard clone](https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard) with GPT-3.5 and GPT-4 included.
|
| 59 |
|