Spaces:

logikon
/

open_cot_leaderboard

Running on CPU Upgrade

Gregor Betz commited on Mar 26, 2024

Commit

db734a2

unverified ·

1 Parent(s): 3b4e965

Link to notebook

Files changed (1) hide show

src/display/about.py CHANGED Viewed

@@ -46,6 +46,7 @@ To assess the reasoning skill of a given `model`, we carry out the following ste
 Each `regime` yields a different _accuracy gain Δ_, and the leaderboard reports (for every `model`/`task`) the best Δ achieved by any regime. All models are evaluated against the same set of regimes.
 ## How is it different from other leaderboards?
@@ -68,7 +69,7 @@ Unlike these leaderboards, the `/\/` Open CoT Leaderboard assesses a model's abi
 ## Test dataset selection (`tasks`)
-The test dataset porblems in the CoT Leaderboard can be solved through clear thinking alone, no specific knowledge is required to do so. They are subsets of the [AGIEval benchmark](https://github.com/ruixiangcui/AGIEval) and re-published as [`logikon-bench`](https://huggingface.co/datasets/logikon/logikon-bench). The `logiqa` dataset has been newly translated from Chinese to English.
 ## Reproducibility

 Each `regime` yields a different _accuracy gain Δ_, and the leaderboard reports (for every `model`/`task`) the best Δ achieved by any regime. All models are evaluated against the same set of regimes.
+A notebook with detailed result exploration and visualization is available [here](https://github.com/logikon-ai/cot-eval/blob/main/notebooks/CoT_Leaderboard_Results_Exploration.ipynb).
 ## How is it different from other leaderboards?
 ## Test dataset selection (`tasks`)
+The test dataset problems in the CoT Leaderboard can be solved through clear thinking alone, no specific knowledge is required to do so. They are subsets of the [AGIEval benchmark](https://github.com/ruixiangcui/AGIEval) and re-published as [`logikon-bench`](https://huggingface.co/datasets/logikon/logikon-bench). The `logiqa` dataset has been newly translated from Chinese to English.
 ## Reproducibility