Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -66,16 +66,6 @@ The 7B model is trained on 8 A100 GPUs. The learning rate (LR) is controlled by
|
|
| 66 |
| 4 | \\(5e-1\\) | 16,000 | 33.55 |
|
| 67 |
| 5 | \\(5e-1\\) | 16,500 | 34.60 |
|
| 68 |
|
| 69 |
-
### Evaluation Benckmarks
|
| 70 |
-
|
| 71 |
-
- **Code Generation**: We compute the average pass@1 scores on HumanEval (0-shot) and MBPP (3-shot).
|
| 72 |
-
|
| 73 |
-
- **Commonsense Reasoning**: We report the average 0-shot perplexity (PPL) on PIQA, SIQA, HellaSwag, WinoGrande, and COPA.
|
| 74 |
-
|
| 75 |
-
- **Reading Comprehension**: We compute the average 0-shot PPL on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
|
| 76 |
-
|
| 77 |
-
- **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and the average PPL on AGI-Eval (0-shot).
|
| 78 |
-
|
| 79 |
### Evaluation Results
|
| 80 |
|
| 81 |
The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Our evaluation is based on the framework [UltraEval](https://github.com/OpenBMB/UltraEval). The evaluation details are listed as follows:
|
|
@@ -86,7 +76,7 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
|
|
| 86 |
|
| 87 |
- **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
|
| 88 |
|
| 89 |
-
- **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot).
|
| 90 |
|
| 91 |
**Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
|
| 92 |
|
|
|
|
| 66 |
| 4 | \\(5e-1\\) | 16,000 | 33.55 |
|
| 67 |
| 5 | \\(5e-1\\) | 16,500 | 34.60 |
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
### Evaluation Results
|
| 70 |
|
| 71 |
The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Our evaluation is based on the framework [UltraEval](https://github.com/OpenBMB/UltraEval). The evaluation details are listed as follows:
|
|
|
|
| 76 |
|
| 77 |
- **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
|
| 78 |
|
| 79 |
+
- **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot).
|
| 80 |
|
| 81 |
**Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
|
| 82 |
|