Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Gregor Betz commited on
description
Browse files- src/display/about.py +1 -1
src/display/about.py
CHANGED
|
@@ -44,7 +44,7 @@ To assess the reasoning skill of a given `model`, we carry out the following ste
|
|
| 44 |
3. `model` answers the test dataset problems _with the reasoning traces appended_ to the prompt, we record the resulting _CoT accuracy_.
|
| 45 |
4. We compute the _accuracy gain Δ_ = _CoT accuracy_ — _baseline accuracy_ for the given `model`, `task`, and `regime`.
|
| 46 |
|
| 47 |
-
Each `regime`
|
| 48 |
|
| 49 |
|
| 50 |
## How is it different from other leaderboards?
|
|
|
|
| 44 |
3. `model` answers the test dataset problems _with the reasoning traces appended_ to the prompt, we record the resulting _CoT accuracy_.
|
| 45 |
4. We compute the _accuracy gain Δ_ = _CoT accuracy_ — _baseline accuracy_ for the given `model`, `task`, and `regime`.
|
| 46 |
|
| 47 |
+
Each `regime` yields a different _accuracy gain Δ_, and the leaderboard reports (for every `model`/`task`) the best Δ achieved by any regime. All models are evaluated against the same set of regimes.
|
| 48 |
|
| 49 |
|
| 50 |
## How is it different from other leaderboards?
|