Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Gregor Betz commited on
Link to notebook
Browse files- src/display/about.py +2 -1
src/display/about.py
CHANGED
|
@@ -46,6 +46,7 @@ To assess the reasoning skill of a given `model`, we carry out the following ste
|
|
| 46 |
|
| 47 |
Each `regime` yields a different _accuracy gain Δ_, and the leaderboard reports (for every `model`/`task`) the best Δ achieved by any regime. All models are evaluated against the same set of regimes.
|
| 48 |
|
|
|
|
| 49 |
|
| 50 |
## How is it different from other leaderboards?
|
| 51 |
|
|
@@ -68,7 +69,7 @@ Unlike these leaderboards, the `/\/` Open CoT Leaderboard assesses a model's abi
|
|
| 68 |
|
| 69 |
## Test dataset selection (`tasks`)
|
| 70 |
|
| 71 |
-
The test dataset
|
| 72 |
|
| 73 |
|
| 74 |
## Reproducibility
|
|
|
|
| 46 |
|
| 47 |
Each `regime` yields a different _accuracy gain Δ_, and the leaderboard reports (for every `model`/`task`) the best Δ achieved by any regime. All models are evaluated against the same set of regimes.
|
| 48 |
|
| 49 |
+
A notebook with detailed result exploration and visualization is available [here](https://github.com/logikon-ai/cot-eval/blob/main/notebooks/CoT_Leaderboard_Results_Exploration.ipynb).
|
| 50 |
|
| 51 |
## How is it different from other leaderboards?
|
| 52 |
|
|
|
|
| 69 |
|
| 70 |
## Test dataset selection (`tasks`)
|
| 71 |
|
| 72 |
+
The test dataset problems in the CoT Leaderboard can be solved through clear thinking alone, no specific knowledge is required to do so. They are subsets of the [AGIEval benchmark](https://github.com/ruixiangcui/AGIEval) and re-published as [`logikon-bench`](https://huggingface.co/datasets/logikon/logikon-bench). The `logiqa` dataset has been newly translated from Chinese to English.
|
| 73 |
|
| 74 |
|
| 75 |
## Reproducibility
|