Commit
·
67b67aa
1
Parent(s):
fc689fe
Update README.md
Browse files
README.md
CHANGED
|
@@ -26,7 +26,7 @@ tags:
|
|
| 26 |
- [Released Weights](#released-weights)
|
| 27 |
- [Benchmark Results](#benchmark-results)
|
| 28 |
- [General Domain](#general-domain)
|
| 29 |
-
- [
|
| 30 |
- [Inference and Deployment](#inference-and-deployment)
|
| 31 |
- [Dependency Installation](#dependency-installation)
|
| 32 |
- [Notice](#notice)
|
|
@@ -86,7 +86,7 @@ In the general domain, we conducted 5-shot tests on the following datasets:
|
|
| 86 |
- [CMMLU](https://github.com/haonan-li/CMMLU) is a comprehensive Chinese evaluation benchmark covering 67 topics, specifically designed to assess language models' knowledge and reasoning capabilities in a Chinese context. We adopted its [official](https://github.com/haonan-li/CMMLU) evaluation approach.
|
| 87 |
|
| 88 |
|
| 89 |
-
###
|
| 90 |
**Performance Comparison on Commonsense Reasoning and Aggregated Benchmarks.** For a fair comparison, we report competing methods' results reproduced by us using their released models. PS: parameter size (billion). T: tokens (trillion). HS: HellaSwag. WG: WinoGrande.
|
| 91 |
|
| 92 |
| Model | PS | T | BoolQ | PIQA | HS | WG | ARC-e | ARC-c | OBQA | MMLU | CMMLU | C-Eval |
|
|
|
|
| 26 |
- [Released Weights](#released-weights)
|
| 27 |
- [Benchmark Results](#benchmark-results)
|
| 28 |
- [General Domain](#general-domain)
|
| 29 |
+
- [Model Results](#model-results)
|
| 30 |
- [Inference and Deployment](#inference-and-deployment)
|
| 31 |
- [Dependency Installation](#dependency-installation)
|
| 32 |
- [Notice](#notice)
|
|
|
|
| 86 |
- [CMMLU](https://github.com/haonan-li/CMMLU) is a comprehensive Chinese evaluation benchmark covering 67 topics, specifically designed to assess language models' knowledge and reasoning capabilities in a Chinese context. We adopted its [official](https://github.com/haonan-li/CMMLU) evaluation approach.
|
| 87 |
|
| 88 |
|
| 89 |
+
### Model Results
|
| 90 |
**Performance Comparison on Commonsense Reasoning and Aggregated Benchmarks.** For a fair comparison, we report competing methods' results reproduced by us using their released models. PS: parameter size (billion). T: tokens (trillion). HS: HellaSwag. WG: WinoGrande.
|
| 91 |
|
| 92 |
| Model | PS | T | BoolQ | PIQA | HS | WG | ARC-e | ARC-c | OBQA | MMLU | CMMLU | C-Eval |
|