Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Gregor Betz
commited on
table
Browse files- src/display/about.py +26 -7
src/display/about.py
CHANGED
|
@@ -49,16 +49,35 @@ Each `regime` has a different _accuracy gain Δ_, and the leaderboard reports (f
|
|
| 49 |
|
| 50 |
## How is it different from other leaderboards?
|
| 51 |
|
| 52 |
-
Performance leaderboards like the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) or [YALL](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard) do a great job in ranking models according task performance.
|
| 53 |
|
| 54 |
Unlike these leaderboards, the `/\/` Open CoT Leaderboard assess a model's ability to effectively reason about a `task`:
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
|
| 64 |
## Test dataset selection (`tasks`)
|
|
|
|
| 49 |
|
| 50 |
## How is it different from other leaderboards?
|
| 51 |
|
| 52 |
+
Performance leaderboards like the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) or [YALL](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard) do a great job in ranking models according to task performance.
|
| 53 |
|
| 54 |
Unlike these leaderboards, the `/\/` Open CoT Leaderboard assess a model's ability to effectively reason about a `task`:
|
| 55 |
|
| 56 |
+
|
| 57 |
+
<table>
|
| 58 |
+
<tr style="text-align:center;">
|
| 59 |
+
<td>🤗 Open LLM Leaderboard </td>
|
| 60 |
+
<td>`/\/` Open CoT Leaderboard </td>
|
| 61 |
+
</tr>
|
| 62 |
+
<tr>
|
| 63 |
+
<td>Can `model` solve `task`?</td>
|
| 64 |
+
<td>Can `model` do CoT to improve in `task`?</td>
|
| 65 |
+
</tr>
|
| 66 |
+
<tr>
|
| 67 |
+
<td>Measures `task` performance.</td>
|
| 68 |
+
<td>Measures ability to reason (about `task`).</td>
|
| 69 |
+
</tr>
|
| 70 |
+
<tr>
|
| 71 |
+
<td>Metric: absolute accuracy.</td>
|
| 72 |
+
<td>Metric: relative accuracy gain.</td>
|
| 73 |
+
</tr>
|
| 74 |
+
<tr>
|
| 75 |
+
<td>Covers broad spectrum of `tasks`.</td>
|
| 76 |
+
<td>Focuses on critical thinking `tasks`.</td>
|
| 77 |
+
</tr>
|
| 78 |
+
</table>
|
| 79 |
+
|
| 80 |
+
|
| 81 |
|
| 82 |
|
| 83 |
## Test dataset selection (`tasks`)
|