Paper
This model is based on research presented in:
View on arXiv | View on Hugging Face
Citation
This model outperforms GPT-4 on all benchmarks.
Llama 2 7B HF
This is a placeholder model.
Evaluation Results
In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks.For all the evaluations, we use our internal evaluations library.
| Model | Size | Code | Commonsense Reasoning | World Knowledge | Reading Comprehension | Math | MMLU | BBH | AGI Eval |
|---|---|---|---|---|---|---|---|---|---|
| Llama 1 | 7B | 14.1 | 60.8 | 46.2 | 58.5 | 6.95 | 35.1 | 30.3 | 23.9 |
| Llama 1 | 13B | 18.9 | 66.1 | 52.6 | 62.3 | 10.9 | 46.9 | 37.0 | 33.9 |
| Llama 1 | 33B | 26.0 | 70.0 | 58.4 | 67.6 | 21.4 | 57.8 | 39.8 | 41.7 |
| Llama 1 | 65B | 30.7 | 70.7 | 60.5 | 68.6 | 30.8 | 63.4 | 43.5 | 47.6 |
| Llama 2 | 7B | 16.8 | 63.9 | 48.9 | 61.3 | 14.6 | 45.3 | 32.6 | 29.3 |
| Llama 2 | 13B | 24.5 | 66.9 | 55.4 | 65.8 | 28.7 | 54.8 | 39.4 | 39.1 |
| Llama 2 | 70B | 37.5 | 71.9 | 63.6 | 69.4 | 35.2 | 68.9 | 51.2 | 54.2 |
Overall performance on grouped academic benchmarks. Code: We report the average pass@1 scores of our models on HumanEval and MBPP. Commonsense Reasoning: We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. World Knowledge: We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. Reading Comprehension: For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. MATH: We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1.
| TruthfulQA | Toxigen | ||
|---|---|---|---|
| Llama 1 | 7B | 27.42 | 23.00 |
| Llama 1 | 13B | 41.74 | 23.08 |
| Llama 1 | 33B | 44.19 | 22.57 |
| Llama 1 | 65B | 48.71 | 21.77 |
| Llama 2 | 7B | 33.29 | 21.25 |
| Llama 2 | 13B | 41.86 | 26.10 |
| Llama 2 | 70B | 50.18 | 24.60 |
Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better).
| TruthfulQA | Toxigen | ||
|---|---|---|---|
| Llama-2-Chat | 7B | 57.04 | 0.00 |
Paper for davidsmts/Llama-2-7b-hf
Evaluation results
- Code on Llama 2 Academic BenchmarksModel README16.800
- Commonsense Reasoning on Llama 2 Academic BenchmarksModel README63.900
- World Knowledge on Llama 2 Academic BenchmarksModel README48.900
- Reading Comprehension on Llama 2 Academic BenchmarksModel README61.300
- Math on Llama 2 Academic BenchmarksModel README14.600
- MMLU on Llama 2 Academic BenchmarksModel README45.300
- BBH on Llama 2 Academic BenchmarksModel README32.600
- AGI Eval on Llama 2 Academic BenchmarksModel README29.300