Spaces:
Running
Running
Commit ยท
c65bc42
1
Parent(s): 28f48c2
updates
Browse files- AGENTS.md +0 -153
- LICENSE.txt +36 -0
- README.md +14 -0
- app.py +73 -57
- config.yaml +106 -8
AGENTS.md
DELETED
|
@@ -1,153 +0,0 @@
|
|
| 1 |
-
# AGENTS.md
|
| 2 |
-
|
| 3 |
-
This repository is a small Hugging Face Space-style app that renders a model
|
| 4 |
-
leaderboard from local CSV files under `results/` (`results/all_results_*.csv`).
|
| 5 |
-
|
| 6 |
-
## Quickstart
|
| 7 |
-
|
| 8 |
-
- Python: use `python3` (3.10+ recommended)
|
| 9 |
-
- App entrypoint: `app.py`
|
| 10 |
-
- Data: CSVs under `results/` by default
|
| 11 |
-
|
| 12 |
-
Create a local venv:
|
| 13 |
-
|
| 14 |
-
```bash
|
| 15 |
-
python -m venv .venv
|
| 16 |
-
source .venv/bin/activate
|
| 17 |
-
python -m pip install -U pip
|
| 18 |
-
python -m pip install gradio pandas
|
| 19 |
-
```
|
| 20 |
-
|
| 21 |
-
Run the app locally:
|
| 22 |
-
|
| 23 |
-
```bash
|
| 24 |
-
python app.py
|
| 25 |
-
# open http://127.0.0.1:7860
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
Configure CSV location/pattern:
|
| 29 |
-
|
| 30 |
-
```bash
|
| 31 |
-
LEADERBOARD_CSV_GLOB='results/*.csv' python app.py
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
Default pattern is `results/all_results_*.csv`.
|
| 35 |
-
|
| 36 |
-
Common environment variables:
|
| 37 |
-
|
| 38 |
-
- `LEADERBOARD_CSV_GLOB`: glob passed to `glob.glob()` (default `results/all_results_*.csv`)
|
| 39 |
-
- `PORT`: Gradio server port (default `7860`)
|
| 40 |
-
- `SERVER_NAME`: bind host (default `127.0.0.1`)
|
| 41 |
-
|
| 42 |
-
## Build / Lint / Test
|
| 43 |
-
|
| 44 |
-
There is currently no formal build system or test suite configured.
|
| 45 |
-
|
| 46 |
-
Recommended commands for agents (safe, minimal):
|
| 47 |
-
|
| 48 |
-
- Type-check (optional): set up `pyright` locally if desired
|
| 49 |
-
- `python -m pip install pyright`
|
| 50 |
-
- `pyright app.py`
|
| 51 |
-
|
| 52 |
-
- Format (optional): use `black` for consistent formatting
|
| 53 |
-
- `python -m pip install black`
|
| 54 |
-
- `black app.py`
|
| 55 |
-
|
| 56 |
-
- Lint (optional): use `ruff`
|
| 57 |
-
- `python -m pip install ruff`
|
| 58 |
-
- `ruff check app.py`
|
| 59 |
-
|
| 60 |
-
If you add tests (preferred: `pytest`):
|
| 61 |
-
|
| 62 |
-
- Run all tests:
|
| 63 |
-
- `pytest`
|
| 64 |
-
- Run a single test file:
|
| 65 |
-
- `pytest tests/test_app.py`
|
| 66 |
-
- Run a single test case:
|
| 67 |
-
- `pytest tests/test_app.py -k test_pivot_results`
|
| 68 |
-
|
| 69 |
-
When adding tooling, keep it lightweight (single-file project) and avoid
|
| 70 |
-
introducing heavy configuration unless requested.
|
| 71 |
-
|
| 72 |
-
## CSV Input Contract
|
| 73 |
-
|
| 74 |
-
The app expects โlong-formโ rows that look like:
|
| 75 |
-
|
| 76 |
-
- `model`: model name
|
| 77 |
-
- `task`: task identifier
|
| 78 |
-
- `score`: numeric score
|
| 79 |
-
|
| 80 |
-
Optional columns used for filtering/processing:
|
| 81 |
-
|
| 82 |
-
- `date`: parsed with `pandas.to_datetime`
|
| 83 |
-
- `type_score`, `task_specifics`, `score_std`, `source_file`
|
| 84 |
-
|
| 85 |
-
Notes:
|
| 86 |
-
|
| 87 |
-
- CSVs may contain `Unnamed: 0` (pandas index column); the app drops `Unnamed:*`.
|
| 88 |
-
- If `model` is missing, it is inferred from the filename prefix `all_results_`.
|
| 89 |
-
- Pivot behavior: one row per model, tasks become columns.
|
| 90 |
-
- If multiple rows exist for the same `(model, task)`, the app uses `mean`.
|
| 91 |
-
- If โLatest onlyโ is enabled and `date` exists, the latest row per
|
| 92 |
-
`(model, task)` is used.
|
| 93 |
-
|
| 94 |
-
## Code Style Guidelines
|
| 95 |
-
|
| 96 |
-
### General
|
| 97 |
-
|
| 98 |
-
- Keep changes minimal; this is a small, single-file app.
|
| 99 |
-
- Prefer clarity over cleverness; prioritize maintainability.
|
| 100 |
-
- Avoid adding side effects at import time (keep I/O in functions).
|
| 101 |
-
|
| 102 |
-
### Formatting
|
| 103 |
-
|
| 104 |
-
- Follow PEP 8.
|
| 105 |
-
- Prefer 88-char lines (Black default) if you introduce formatting.
|
| 106 |
-
- Use double quotes for strings (current style in `app.py`).
|
| 107 |
-
|
| 108 |
-
### Imports
|
| 109 |
-
|
| 110 |
-
- Standard library imports first, then third party (`gradio`, `pandas`).
|
| 111 |
-
- Group imports and keep them sorted.
|
| 112 |
-
- Avoid importing unused modules; remove dead imports.
|
| 113 |
-
|
| 114 |
-
### Types
|
| 115 |
-
|
| 116 |
-
- Type annotate public helpers where it adds clarity, but donโt over-annotate
|
| 117 |
-
pandas-heavy code (pandas typing is noisy).
|
| 118 |
-
- Prefer built-in generics (`list[str]`, `dict[str, str]`) on Python 3.9+.
|
| 119 |
-
|
| 120 |
-
### Naming
|
| 121 |
-
|
| 122 |
-
- Functions: `snake_case`.
|
| 123 |
-
- Constants: `UPPER_SNAKE_CASE`.
|
| 124 |
-
- Local variables: descriptive (`filtered`, `combined`, `task_cols`).
|
| 125 |
-
|
| 126 |
-
### Error Handling
|
| 127 |
-
|
| 128 |
-
- Fail fast with actionable messages.
|
| 129 |
-
- Example: if no CSV files match, raise `FileNotFoundError` with hints.
|
| 130 |
-
- When parsing/transforming data, be tolerant:
|
| 131 |
-
- Use `errors="coerce"` for datetime/numeric conversions.
|
| 132 |
-
- Handle empty frames gracefully (return empty tables).
|
| 133 |
-
|
| 134 |
-
### Data Processing Conventions
|
| 135 |
-
|
| 136 |
-
- Keep data normalization in one place (`load_results`).
|
| 137 |
-
- Keep pivot/aggregation logic in dedicated helpers (`pivot_results`,
|
| 138 |
-
`summarize_overall`).
|
| 139 |
-
- When changing pivot semantics (e.g., incorporate `type_score`), do it
|
| 140 |
-
explicitly and keep UI in sync.
|
| 141 |
-
|
| 142 |
-
### UI Conventions (Gradio)
|
| 143 |
-
|
| 144 |
-
- Keep filters at the top; leaderboard table below.
|
| 145 |
-
- Avoid expensive recomputation inside callbacks; reuse loaded data and only
|
| 146 |
-
derive filtered/pivoted views.
|
| 147 |
-
|
| 148 |
-
## Cursor / Copilot Rules
|
| 149 |
-
|
| 150 |
-
- No Cursor rules found (`.cursor/rules/` or `.cursorrules`).
|
| 151 |
-
- No GitHub Copilot instructions found (`.github/copilot-instructions.md`).
|
| 152 |
-
|
| 153 |
-
If you add those in the future, update this file to reflect them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
LICENSE.txt
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
------------------------------------------------------------------------------
|
| 2 |
+
Copyright Notice Fraunhofer leaderboard
|
| 3 |
+
------------------------------------------------------------------------------
|
| 4 |
+
|
| 5 |
+
/****************************************************************************\
|
| 6 |
+
* (C) Copyright Fraunhofer IIS and its partners (2025)
|
| 7 |
+
* All Rights Reserved
|
| 8 |
+
*
|
| 9 |
+
* Please be advised that this software and/or program delivery is
|
| 10 |
+
* Confidential Information of Fraunhofer and subject to and covered by the
|
| 11 |
+
*
|
| 12 |
+
* "Konsortialvereinbarung รผber die Zusammenarbeit im BMWE - Verbundprojekt ยปEU-SAI - Souverรคne KI fรผr Europaยซ (Soofi)"
|
| 13 |
+
* and its Amendments (if any) between you/your company or institution and Fraunhofer IIS
|
| 14 |
+
*
|
| 15 |
+
* You may use this software and/or program only under the terms and
|
| 16 |
+
* conditions described in the above mentioned Konsortialvereinbarung รผber die Zusammenarbeit im BMWE - Verbundprojekt
|
| 17 |
+
* ยปEU-SAI - Souverรคne KI fรผr Europaยซ (Soofi)ยซ. Any other and/or further use requires a separate
|
| 18 |
+
* agreement.
|
| 19 |
+
*
|
| 20 |
+
* This software and/or program is protected by copyright law and
|
| 21 |
+
* international treaties. Any reproduction or distribution of this software
|
| 22 |
+
* and/or program, or any portion of it, may result in severe civil and
|
| 23 |
+
* criminal penalties, and will be prosecuted to the maximum extent possible
|
| 24 |
+
* under law.
|
| 25 |
+
\***************************************************************************/
|
| 26 |
+
|
| 27 |
+
------------------------------------------------------------------------------
|
| 28 |
+
Third Party Legal Notices
|
| 29 |
+
------------------------------------------------------------------------------
|
| 30 |
+
Parts of this package may include 3rd party software licensed under terms that
|
| 31 |
+
require Fraunhofer to display certain legal notices. Legal notices are
|
| 32 |
+
found in the folder ThirdPartyLegalNotices/
|
| 33 |
+
|
| 34 |
+
------------------------------------------------------------------------------
|
| 35 |
+
(C) copyright 2025 by Fraunhofer IIS
|
| 36 |
+
------------------------------------------------------------------------------
|
README.md
CHANGED
|
@@ -16,6 +16,8 @@ created by the `soofi-eval` framework.
|
|
| 16 |
|
| 17 |
## Install
|
| 18 |
|
|
|
|
|
|
|
| 19 |
```bash
|
| 20 |
python3 -m venv .venv
|
| 21 |
source .venv/bin/activate
|
|
@@ -23,6 +25,18 @@ python -m pip install -U pip
|
|
| 23 |
python -m pip install -e .
|
| 24 |
```
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
## Run
|
| 27 |
|
| 28 |
By default, the app expects the CSV files to live under the `results/` directory.
|
|
|
|
| 16 |
|
| 17 |
## Install
|
| 18 |
|
| 19 |
+
### Local install
|
| 20 |
+
|
| 21 |
```bash
|
| 22 |
python3 -m venv .venv
|
| 23 |
source .venv/bin/activate
|
|
|
|
| 25 |
python -m pip install -e .
|
| 26 |
```
|
| 27 |
|
| 28 |
+
### Hugging Face Spaces
|
| 29 |
+
|
| 30 |
+
This repo is compatible with Hugging Face Spaces (Gradio) and uses the root
|
| 31 |
+
`app.py` as the Space entrypoint.
|
| 32 |
+
|
| 33 |
+
On Spaces, dependencies are installed from `requirements.txt` automatically.
|
| 34 |
+
If you're running a similar containerized setup, you can also install with:
|
| 35 |
+
|
| 36 |
+
```bash
|
| 37 |
+
python -m pip install -r requirements.txt
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
## Run
|
| 41 |
|
| 42 |
By default, the app expects the CSV files to live under the `results/` directory.
|
app.py
CHANGED
|
@@ -17,6 +17,7 @@ class LeaderboardConfig:
|
|
| 17 |
server_port: int = 7860
|
| 18 |
results_dir: str = "results"
|
| 19 |
csv_pattern: str = "*.csv"
|
|
|
|
| 20 |
title: str = "Leaderboard"
|
| 21 |
description_md: str | None = None
|
| 22 |
task_groups: list[dict] | None = None
|
|
@@ -296,7 +297,8 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
|
|
| 296 |
def apply_filters(
|
| 297 |
selected_tasks: list[str],
|
| 298 |
selected_models: list[str],
|
| 299 |
-
|
|
|
|
| 300 |
) -> tuple[pd.DataFrame, str]:
|
| 301 |
# Use the global df loaded at startup instead of reloading
|
| 302 |
if df.empty:
|
|
@@ -337,20 +339,17 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
|
|
| 337 |
if selected_models and "model" in filtered.columns:
|
| 338 |
filtered = filtered[filtered["model"].astype(str).isin(selected_models)]
|
| 339 |
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
needle, na=False
|
| 352 |
-
)
|
| 353 |
-
filtered = filtered[mask]
|
| 354 |
|
| 355 |
try:
|
| 356 |
return pivot_results(filtered), ""
|
|
@@ -362,7 +361,7 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
|
|
| 362 |
)
|
| 363 |
return pd.DataFrame(), msg
|
| 364 |
|
| 365 |
-
wide0, err0 = apply_filters([], [], "")
|
| 366 |
if init_error:
|
| 367 |
err0 = init_error
|
| 368 |
wide0 = pd.DataFrame()
|
|
@@ -374,26 +373,31 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
|
|
| 374 |
|
| 375 |
with gr.Column():
|
| 376 |
with gr.Row():
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
|
| 388 |
-
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
|
|
|
|
|
|
|
| 392 |
|
| 393 |
with gr.Row():
|
| 394 |
apply_btn = gr.Button("โ
Apply")
|
| 395 |
reset_btn = gr.Button("๐งน Reset Filters")
|
| 396 |
-
|
|
|
|
|
|
|
|
|
|
| 397 |
|
| 398 |
presets = load_button_presets(config)
|
| 399 |
preset_buttons: list[gr.Button] = []
|
|
@@ -416,61 +420,73 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
|
|
| 416 |
wrap=False,
|
| 417 |
max_height=760,
|
| 418 |
visible=len(df) > 0,
|
|
|
|
| 419 |
)
|
| 420 |
|
| 421 |
def _apply_with_tasks(
|
| 422 |
-
preset: list[str],
|
|
|
|
|
|
|
|
|
|
| 423 |
) -> tuple[list[str], pd.DataFrame, dict]:
|
| 424 |
new_tasks = _set_tasks(preset)
|
| 425 |
-
table, err = apply_filters(
|
| 426 |
-
|
|
|
|
|
|
|
| 427 |
|
| 428 |
-
def _apply_reset(
|
| 429 |
-
|
| 430 |
-
|
| 431 |
-
new_tasks: list[str] = []
|
| 432 |
-
table, err = apply_filters(new_tasks, selected_models, q)
|
| 433 |
-
return new_tasks, table, gr.update(value=err, visible=bool(err))
|
| 434 |
|
| 435 |
for preset, btn in zip(presets, preset_buttons):
|
| 436 |
preset_tasks = preset.get("tasks", [])
|
| 437 |
|
| 438 |
btn.click(
|
| 439 |
-
fn=lambda selected_models,
|
| 440 |
preset_tasks,
|
| 441 |
selected_models,
|
| 442 |
-
|
|
|
|
| 443 |
),
|
| 444 |
-
inputs=[models,
|
| 445 |
-
outputs=[tasks, wide_df, error_md],
|
| 446 |
)
|
| 447 |
|
| 448 |
-
def _apply_from_controls(
|
| 449 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 450 |
return table, gr.update(value=err, visible=bool(err))
|
| 451 |
|
| 452 |
apply_btn.click(
|
| 453 |
fn=_apply_from_controls,
|
| 454 |
-
inputs=[tasks, models,
|
| 455 |
outputs=[wide_df, error_md],
|
| 456 |
)
|
| 457 |
reset_btn.click(
|
| 458 |
fn=_apply_reset,
|
| 459 |
-
|
| 460 |
-
outputs=[tasks, wide_df, error_md],
|
| 461 |
)
|
| 462 |
|
| 463 |
-
def _reload_data(
|
|
|
|
|
|
|
| 464 |
nonlocal df
|
| 465 |
df = load_results(config)
|
| 466 |
-
table, err = apply_filters(
|
|
|
|
|
|
|
| 467 |
return table, gr.update(value=err, visible=bool(err))
|
| 468 |
|
| 469 |
-
reload_btn
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
|
| 473 |
-
|
|
|
|
| 474 |
|
| 475 |
return demo
|
| 476 |
|
|
|
|
| 17 |
server_port: int = 7860
|
| 18 |
results_dir: str = "results"
|
| 19 |
csv_pattern: str = "*.csv"
|
| 20 |
+
enable_data_reload: bool = True
|
| 21 |
title: str = "Leaderboard"
|
| 22 |
description_md: str | None = None
|
| 23 |
task_groups: list[dict] | None = None
|
|
|
|
| 297 |
def apply_filters(
|
| 298 |
selected_tasks: list[str],
|
| 299 |
selected_models: list[str],
|
| 300 |
+
search_task: str,
|
| 301 |
+
search_model: str,
|
| 302 |
) -> tuple[pd.DataFrame, str]:
|
| 303 |
# Use the global df loaded at startup instead of reloading
|
| 304 |
if df.empty:
|
|
|
|
| 339 |
if selected_models and "model" in filtered.columns:
|
| 340 |
filtered = filtered[filtered["model"].astype(str).isin(selected_models)]
|
| 341 |
|
| 342 |
+
for search, col in [(search_task, "task"), (search_model, "model")]:
|
| 343 |
+
if search.strip():
|
| 344 |
+
needle = search.strip().lower()
|
| 345 |
+
hay_cols = [col] if col in filtered.columns else []
|
| 346 |
+
if hay_cols:
|
| 347 |
+
mask = False
|
| 348 |
+
for c in hay_cols:
|
| 349 |
+
mask = mask | filtered[c].astype(str).str.lower().str.contains(
|
| 350 |
+
needle, na=False
|
| 351 |
+
)
|
| 352 |
+
filtered = filtered[mask]
|
|
|
|
|
|
|
|
|
|
| 353 |
|
| 354 |
try:
|
| 355 |
return pivot_results(filtered), ""
|
|
|
|
| 361 |
)
|
| 362 |
return pd.DataFrame(), msg
|
| 363 |
|
| 364 |
+
wide0, err0 = apply_filters([], [], "", "")
|
| 365 |
if init_error:
|
| 366 |
err0 = init_error
|
| 367 |
wide0 = pd.DataFrame()
|
|
|
|
| 373 |
|
| 374 |
with gr.Column():
|
| 375 |
with gr.Row():
|
| 376 |
+
with gr.Column():
|
| 377 |
+
tasks = gr.Dropdown(
|
| 378 |
+
choices=task_choices,
|
| 379 |
+
value=[],
|
| 380 |
+
multiselect=True,
|
| 381 |
+
label="Tasks",
|
| 382 |
+
)
|
| 383 |
+
search_task = gr.Textbox(label="Search task", placeholder="task regex")
|
| 384 |
+
|
| 385 |
+
with gr.Column():
|
| 386 |
+
models = gr.Dropdown(
|
| 387 |
+
choices=model_choices,
|
| 388 |
+
value=[],
|
| 389 |
+
multiselect=True,
|
| 390 |
+
label="Models",
|
| 391 |
+
)
|
| 392 |
+
search_model = gr.Textbox(label="Search model", placeholder="model regex")
|
| 393 |
|
| 394 |
with gr.Row():
|
| 395 |
apply_btn = gr.Button("โ
Apply")
|
| 396 |
reset_btn = gr.Button("๐งน Reset Filters")
|
| 397 |
+
if config.enable_data_reload:
|
| 398 |
+
reload_btn = gr.Button("๐ Reload Data")
|
| 399 |
+
else:
|
| 400 |
+
reload_btn = None
|
| 401 |
|
| 402 |
presets = load_button_presets(config)
|
| 403 |
preset_buttons: list[gr.Button] = []
|
|
|
|
| 420 |
wrap=False,
|
| 421 |
max_height=760,
|
| 422 |
visible=len(df) > 0,
|
| 423 |
+
pinned_columns=2,
|
| 424 |
)
|
| 425 |
|
| 426 |
def _apply_with_tasks(
|
| 427 |
+
preset: list[str],
|
| 428 |
+
selected_models: list[str],
|
| 429 |
+
searched_tasks: str,
|
| 430 |
+
searched_models: str,
|
| 431 |
) -> tuple[list[str], pd.DataFrame, dict]:
|
| 432 |
new_tasks = _set_tasks(preset)
|
| 433 |
+
table, err = apply_filters(
|
| 434 |
+
new_tasks, selected_models, searched_tasks, searched_models
|
| 435 |
+
)
|
| 436 |
+
return new_tasks, "", table, gr.update(value=err, visible=bool(err))
|
| 437 |
|
| 438 |
+
def _apply_reset() -> tuple[list[str], list[str], str, pd.DataFrame, dict]:
|
| 439 |
+
table, err = apply_filters([], [], "", "")
|
| 440 |
+
return [], [], "", "", table, gr.update(value=err, visible=bool(err))
|
|
|
|
|
|
|
|
|
|
| 441 |
|
| 442 |
for preset, btn in zip(presets, preset_buttons):
|
| 443 |
preset_tasks = preset.get("tasks", [])
|
| 444 |
|
| 445 |
btn.click(
|
| 446 |
+
fn=lambda selected_models, sm, preset_tasks=preset_tasks: _apply_with_tasks(
|
| 447 |
preset_tasks,
|
| 448 |
selected_models,
|
| 449 |
+
"",
|
| 450 |
+
sm,
|
| 451 |
),
|
| 452 |
+
inputs=[models, search_model],
|
| 453 |
+
outputs=[tasks, search_task, wide_df, error_md],
|
| 454 |
)
|
| 455 |
|
| 456 |
+
def _apply_from_controls(
|
| 457 |
+
selected_tasks, selected_models, searched_tasks: str, searched_models: str
|
| 458 |
+
):
|
| 459 |
+
table, err = apply_filters(
|
| 460 |
+
selected_tasks, selected_models, searched_tasks, searched_models
|
| 461 |
+
)
|
| 462 |
return table, gr.update(value=err, visible=bool(err))
|
| 463 |
|
| 464 |
apply_btn.click(
|
| 465 |
fn=_apply_from_controls,
|
| 466 |
+
inputs=[tasks, models, search_task, search_model],
|
| 467 |
outputs=[wide_df, error_md],
|
| 468 |
)
|
| 469 |
reset_btn.click(
|
| 470 |
fn=_apply_reset,
|
| 471 |
+
outputs=[tasks, models, search_task, search_model, wide_df, error_md],
|
|
|
|
| 472 |
)
|
| 473 |
|
| 474 |
+
def _reload_data(
|
| 475 |
+
selected_tasks, selected_models, searched_tasks: str, searched_models: str
|
| 476 |
+
):
|
| 477 |
nonlocal df
|
| 478 |
df = load_results(config)
|
| 479 |
+
table, err = apply_filters(
|
| 480 |
+
selected_tasks, selected_models, searched_tasks, searched_models
|
| 481 |
+
)
|
| 482 |
return table, gr.update(value=err, visible=bool(err))
|
| 483 |
|
| 484 |
+
if reload_btn:
|
| 485 |
+
reload_btn.click(
|
| 486 |
+
fn=_reload_data,
|
| 487 |
+
inputs=[tasks, models, search_task, search_model],
|
| 488 |
+
outputs=[wide_df, error_md],
|
| 489 |
+
)
|
| 490 |
|
| 491 |
return demo
|
| 492 |
|
config.yaml
CHANGED
|
@@ -1,25 +1,123 @@
|
|
| 1 |
results_dir: results
|
| 2 |
csv_pattern: "*.csv"
|
|
|
|
| 3 |
|
| 4 |
-
title: "
|
| 5 |
# Optional markdown shown under the title.
|
| 6 |
description_md: |
|
| 7 |
- Use the task group buttons for quick presets.
|
| 8 |
-
- Use ๐ Reload Data after dropping in new CSVs.
|
| 9 |
|
| 10 |
# Optional task groups that results in buttons filtering tasks on the UI
|
| 11 |
task_groups:
|
| 12 |
-
- id: base
|
| 13 |
-
label: "๐งฑ Base"
|
| 14 |
-
tasks: []
|
| 15 |
- id: instruct
|
| 16 |
label: "๐ฌ Instruct"
|
| 17 |
-
tasks:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
- id: reasoning
|
| 19 |
label: "๐ง Reasoning"
|
| 20 |
tasks:
|
| 21 |
-
- aime24
|
| 22 |
-
- aime25
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
server_name: "0.0.0.0"
|
| 25 |
server_port: 7860
|
|
|
|
| 1 |
results_dir: results
|
| 2 |
csv_pattern: "*.csv"
|
| 3 |
+
enable_data_reload: false
|
| 4 |
|
| 5 |
+
title: "Leaderboard"
|
| 6 |
# Optional markdown shown under the title.
|
| 7 |
description_md: |
|
| 8 |
- Use the task group buttons for quick presets.
|
|
|
|
| 9 |
|
| 10 |
# Optional task groups that results in buttons filtering tasks on the UI
|
| 11 |
task_groups:
|
|
|
|
|
|
|
|
|
|
| 12 |
- id: instruct
|
| 13 |
label: "๐ฌ Instruct"
|
| 14 |
+
tasks:
|
| 15 |
+
- hellaswag
|
| 16 |
+
- arc_challenge
|
| 17 |
+
- truthfulqa_mc2
|
| 18 |
+
- mmlu
|
| 19 |
+
- global_piqa_completions_eng_latn
|
| 20 |
+
- winogrande
|
| 21 |
+
- openbookqa
|
| 22 |
+
- leaderboard_bbh
|
| 23 |
+
- ogx_hellaswagx_de
|
| 24 |
+
- ogx_hellaswagx_fr
|
| 25 |
+
- ogx_hellaswagx_es
|
| 26 |
+
- ogx_hellaswagx_it
|
| 27 |
+
- ogx_arcx_challenge_de
|
| 28 |
+
- ogx_arcx_challenge_fr
|
| 29 |
+
- ogx_arcx_challenge_es
|
| 30 |
+
- ogx_arcx_challenge_it
|
| 31 |
+
- ogx_truthfulqax_mc2_de
|
| 32 |
+
- ogx_truthfulqax_mc2_fr
|
| 33 |
+
- ogx_truthfulqax_mc2_es
|
| 34 |
+
- ogx_truthfulqax_mc2_it
|
| 35 |
+
- global_mmlu_de
|
| 36 |
+
- global_mmlu_fr
|
| 37 |
+
- global_mmlu_es
|
| 38 |
+
- global_mmlu_it
|
| 39 |
+
- global_piqa_completions_deu_latn
|
| 40 |
+
- global_piqa_completions_fra_latn_fran
|
| 41 |
+
- global_piqa_completions_ita_latn
|
| 42 |
+
- global_piqa_completions_spa_latn_spai
|
| 43 |
+
- leaderboard_ifeval
|
| 44 |
+
- ifbench
|
| 45 |
+
- ifeval_DE
|
| 46 |
+
- ifbench_DE
|
| 47 |
+
- ifbench_FR
|
| 48 |
+
- ifbench_IT
|
| 49 |
+
- ifbench_ES
|
| 50 |
+
- gsm_plus
|
| 51 |
+
- gsm8k
|
| 52 |
+
- ogx_gsm8kx_de
|
| 53 |
+
- ogx_gsm8kx_fr
|
| 54 |
+
- ogx_gsm8kx_es
|
| 55 |
+
- ogx_gsm8kx_it
|
| 56 |
- id: reasoning
|
| 57 |
label: "๐ง Reasoning"
|
| 58 |
tasks:
|
| 59 |
+
- aime24
|
| 60 |
+
- aime25
|
| 61 |
+
- gpqa_diamond
|
| 62 |
+
- id: EN
|
| 63 |
+
label: "๐ฌ๐ง"
|
| 64 |
+
tasks:
|
| 65 |
+
- hellaswag
|
| 66 |
+
- arc_challenge
|
| 67 |
+
- truthfulqa_mc2
|
| 68 |
+
- mmlu
|
| 69 |
+
- global_piqa_completions_eng_latn
|
| 70 |
+
- winogrande
|
| 71 |
+
- openbookqa
|
| 72 |
+
- leaderboard_bbh
|
| 73 |
+
- leaderboard_ifeval
|
| 74 |
+
- ifbench
|
| 75 |
+
- gsm_plus
|
| 76 |
+
- gsm8k
|
| 77 |
+
- aime24
|
| 78 |
+
- aime25
|
| 79 |
+
- gpqa_diamond
|
| 80 |
+
- id: DE
|
| 81 |
+
label: "๐ฉ๐ช"
|
| 82 |
+
tasks:
|
| 83 |
+
- ogx_hellaswagx_de
|
| 84 |
+
- ogx_arcx_challenge_de
|
| 85 |
+
- ogx_truthfulqax_mc2_de
|
| 86 |
+
- ogx_gsm8kx_de
|
| 87 |
+
- global_mmlu_de
|
| 88 |
+
- global_piqa_completions_deu_latn
|
| 89 |
+
- ifeval_DE
|
| 90 |
+
- ifbench_DE
|
| 91 |
+
- id: FR
|
| 92 |
+
label: "๐ซ๐ท"
|
| 93 |
+
tasks:
|
| 94 |
+
- ogx_hellaswagx_fr
|
| 95 |
+
- ogx_arcx_challenge_fr
|
| 96 |
+
- ogx_truthfulqax_mc2_fr
|
| 97 |
+
- ogx_gsm8kx_fr
|
| 98 |
+
- global_mmlu_fr
|
| 99 |
+
- global_piqa_completions_fra_latn_fran
|
| 100 |
+
- ifbench_FR
|
| 101 |
+
- id: IT
|
| 102 |
+
label: "๐ฎ๐น"
|
| 103 |
+
tasks:
|
| 104 |
+
- ogx_hellaswagx_it
|
| 105 |
+
- ogx_arcx_challenge_it
|
| 106 |
+
- ogx_truthfulqax_mc2_it
|
| 107 |
+
- ogx_gsm8kx_it
|
| 108 |
+
- global_mmlu_it
|
| 109 |
+
- global_piqa_completions_ita_latn
|
| 110 |
+
- ifbench_IT
|
| 111 |
+
- id: ES
|
| 112 |
+
label: "๐ช๐ธ"
|
| 113 |
+
tasks:
|
| 114 |
+
- ogx_hellaswagx_es
|
| 115 |
+
- ogx_arcx_challenge_es
|
| 116 |
+
- ogx_truthfulqax_mc2_es
|
| 117 |
+
- ogx_gsm8kx_es
|
| 118 |
+
- global_mmlu_es
|
| 119 |
+
- global_piqa_completions_spa_latn_spai
|
| 120 |
+
- ifbench_ES
|
| 121 |
|
| 122 |
server_name: "0.0.0.0"
|
| 123 |
server_port: 7860
|