Spaces:

viktorhangya
/

test_leaderboard

Running

App Files Files Community

viktorhangya commited on Feb 12

Commit

c65bc42

1 Parent(s): 28f48c2

updates

Browse files

Files changed (5) hide show

AGENTS.md +0 -153
LICENSE.txt +36 -0
README.md +14 -0
app.py +73 -57
config.yaml +106 -8

AGENTS.md DELETED Viewed

@@ -1,153 +0,0 @@
-# AGENTS.md
-This repository is a small Hugging Face Space-style app that renders a model
-leaderboard from local CSV files under `results/` (`results/all_results_*.csv`).
-## Quickstart
-- Python: use `python3` (3.10+ recommended)
-- App entrypoint: `app.py`
-- Data: CSVs under `results/` by default
-Create a local venv:
-```bash
-python -m venv .venv
-source .venv/bin/activate
-python -m pip install -U pip
-python -m pip install gradio pandas
-```
-Run the app locally:
-```bash
-python app.py
-# open http://127.0.0.1:7860
-```
-Configure CSV location/pattern:
-```bash
-LEADERBOARD_CSV_GLOB='results/*.csv' python app.py
-```
-Default pattern is `results/all_results_*.csv`.
-Common environment variables:
-- `LEADERBOARD_CSV_GLOB`: glob passed to `glob.glob()` (default `results/all_results_*.csv`)
-- `PORT`: Gradio server port (default `7860`)
-- `SERVER_NAME`: bind host (default `127.0.0.1`)
-## Build / Lint / Test
-There is currently no formal build system or test suite configured.
-Recommended commands for agents (safe, minimal):
-- Type-check (optional): set up `pyright` locally if desired
-  - `python -m pip install pyright`
-  - `pyright app.py`
-- Format (optional): use `black` for consistent formatting
-  - `python -m pip install black`
-  - `black app.py`
-- Lint (optional): use `ruff`
-  - `python -m pip install ruff`
-  - `ruff check app.py`
-If you add tests (preferred: `pytest`):
-- Run all tests:
-  - `pytest`
-- Run a single test file:
-  - `pytest tests/test_app.py`
-- Run a single test case:
-  - `pytest tests/test_app.py -k test_pivot_results`
-When adding tooling, keep it lightweight (single-file project) and avoid
-introducing heavy configuration unless requested.
-## CSV Input Contract
-The app expects “long-form” rows that look like:
-- `model`: model name
-- `task`: task identifier
-- `score`: numeric score
-Optional columns used for filtering/processing:
-- `date`: parsed with `pandas.to_datetime`
-- `type_score`, `task_specifics`, `score_std`, `source_file`
-Notes:
-- CSVs may contain `Unnamed: 0` (pandas index column); the app drops `Unnamed:*`.
-- If `model` is missing, it is inferred from the filename prefix `all_results_`.
-- Pivot behavior: one row per model, tasks become columns.
-  - If multiple rows exist for the same `(model, task)`, the app uses `mean`.
-  - If “Latest only” is enabled and `date` exists, the latest row per
-    `(model, task)` is used.
-## Code Style Guidelines
-### General
-- Keep changes minimal; this is a small, single-file app.
-- Prefer clarity over cleverness; prioritize maintainability.
-- Avoid adding side effects at import time (keep I/O in functions).
-### Formatting
-- Follow PEP 8.
-- Prefer 88-char lines (Black default) if you introduce formatting.
-- Use double quotes for strings (current style in `app.py`).
-### Imports
-- Standard library imports first, then third party (`gradio`, `pandas`).
-- Group imports and keep them sorted.
-- Avoid importing unused modules; remove dead imports.
-### Types
-- Type annotate public helpers where it adds clarity, but don’t over-annotate
-  pandas-heavy code (pandas typing is noisy).
-- Prefer built-in generics (`list[str]`, `dict[str, str]`) on Python 3.9+.
-### Naming
-- Functions: `snake_case`.
-- Constants: `UPPER_SNAKE_CASE`.
-- Local variables: descriptive (`filtered`, `combined`, `task_cols`).
-### Error Handling
-- Fail fast with actionable messages.
-  - Example: if no CSV files match, raise `FileNotFoundError` with hints.
-- When parsing/transforming data, be tolerant:
-  - Use `errors="coerce"` for datetime/numeric conversions.
-  - Handle empty frames gracefully (return empty tables).
-### Data Processing Conventions
-- Keep data normalization in one place (`load_results`).
-- Keep pivot/aggregation logic in dedicated helpers (`pivot_results`,
-  `summarize_overall`).
-- When changing pivot semantics (e.g., incorporate `type_score`), do it
-  explicitly and keep UI in sync.
-### UI Conventions (Gradio)
-- Keep filters at the top; leaderboard table below.
-- Avoid expensive recomputation inside callbacks; reuse loaded data and only
-  derive filtered/pivoted views.
-## Cursor / Copilot Rules
-- No Cursor rules found (`.cursor/rules/` or `.cursorrules`).
-- No GitHub Copilot instructions found (`.github/copilot-instructions.md`).
-If you add those in the future, update this file to reflect them.

LICENSE.txt ADDED Viewed

	@@ -0,0 +1,36 @@

+------------------------------------------------------------------------------
+Copyright Notice Fraunhofer leaderboard
+------------------------------------------------------------------------------
+/****************************************************************************\
+*  (C) Copyright Fraunhofer IIS and its partners (2025)
+*                           All Rights Reserved
+*
+* Please be advised that this software and/or program delivery is
+* Confidential Information of Fraunhofer and subject to and covered by the
+*
+* "Konsortialvereinbarung über die Zusammenarbeit im BMWE - Verbundprojekt »EU-SAI - Souveräne KI für Europa« (Soofi)"
+* and its Amendments (if any) between you/your company or institution and Fraunhofer IIS
+*
+* You may use this software and/or program only under the terms and
+* conditions described in the above mentioned Konsortialvereinbarung über die Zusammenarbeit im BMWE - Verbundprojekt
+* »EU-SAI - Souveräne KI für Europa« (Soofi)«. Any other and/or further use requires a separate
+* agreement.
+*
+* This software and/or program is protected by copyright law and
+* international treaties. Any reproduction or distribution of this software
+* and/or program, or any portion of it, may result in severe civil and
+* criminal penalties, and will be prosecuted to the maximum extent possible
+* under law.
+\***************************************************************************/
+------------------------------------------------------------------------------
+Third Party Legal Notices
+------------------------------------------------------------------------------
+Parts of this package may include 3rd party software licensed under terms that
+require Fraunhofer to display certain legal notices. Legal notices are
+found in the folder ThirdPartyLegalNotices/
+------------------------------------------------------------------------------
+(C) copyright 2025 by Fraunhofer IIS
+------------------------------------------------------------------------------

README.md CHANGED Viewed

@@ -16,6 +16,8 @@ created by the `soofi-eval` framework.
 ## Install
 ```bash
 python3 -m venv .venv
 source .venv/bin/activate
@@ -23,6 +25,18 @@ python -m pip install -U pip
 python -m pip install -e .
 ```
 ## Run
 By default, the app expects the CSV files to live under the `results/` directory.

 ## Install
+### Local install
 ```bash
 python3 -m venv .venv
 source .venv/bin/activate
 python -m pip install -e .
 ```
+### Hugging Face Spaces
+This repo is compatible with Hugging Face Spaces (Gradio) and uses the root
+`app.py` as the Space entrypoint.
+On Spaces, dependencies are installed from `requirements.txt` automatically.
+If you're running a similar containerized setup, you can also install with:
+```bash
+python -m pip install -r requirements.txt
+```
 ## Run
 By default, the app expects the CSV files to live under the `results/` directory.

app.py CHANGED Viewed

@@ -17,6 +17,7 @@ class LeaderboardConfig:
     server_port: int = 7860
     results_dir: str = "results"
     csv_pattern: str = "*.csv"
     title: str = "Leaderboard"
     description_md: str | None = None
     task_groups: list[dict] | None = None
@@ -296,7 +297,8 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
     def apply_filters(
         selected_tasks: list[str],
         selected_models: list[str],
-        search: str,
     ) -> tuple[pd.DataFrame, str]:
         # Use the global df loaded at startup instead of reloading
         if df.empty:
@@ -337,20 +339,17 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
         if selected_models and "model" in filtered.columns:
             filtered = filtered[filtered["model"].astype(str).isin(selected_models)]
-        if search.strip():
-            needle = search.strip().lower()
-            hay_cols = [
-                c
-                for c in ["model", "task", "type_score", "source_file"]
-                if c in filtered.columns
-            ]
-            if hay_cols:
-                mask = False
-                for c in hay_cols:
-                    mask = mask | filtered[c].astype(str).str.lower().str.contains(
-                        needle, na=False
-                    )
-                filtered = filtered[mask]
         try:
             return pivot_results(filtered), ""
@@ -362,7 +361,7 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
             )
             return pd.DataFrame(), msg
-    wide0, err0 = apply_filters([], [], "")
     if init_error:
         err0 = init_error
         wide0 = pd.DataFrame()
@@ -374,26 +373,31 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
         with gr.Column():
             with gr.Row():
-                tasks = gr.Dropdown(
-                    choices=task_choices,
-                    value=[],
-                    multiselect=True,
-                    label="Tasks (columns)",
-                )
-                models = gr.Dropdown(
-                    choices=model_choices,
-                    value=[],
-                    multiselect=True,
-                    label="Models",
-                )
-            with gr.Row():
-                search = gr.Textbox(label="Search", placeholder="model/task/type…")
             with gr.Row():
                 apply_btn = gr.Button("✅ Apply")
                 reset_btn = gr.Button("🧹 Reset Filters")
-                reload_btn = gr.Button("📊 Reload Data")
             presets = load_button_presets(config)
             preset_buttons: list[gr.Button] = []
@@ -416,61 +420,73 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
                 wrap=False,
                 max_height=760,
                 visible=len(df) > 0,
             )
             def _apply_with_tasks(
-                preset: list[str], selected_models, q: str
             ) -> tuple[list[str], pd.DataFrame, dict]:
                 new_tasks = _set_tasks(preset)
-                table, err = apply_filters(new_tasks, selected_models, q)
-                return new_tasks, table, gr.update(value=err, visible=bool(err))
-            def _apply_reset(
-                selected_models, q: str
-            ) -> tuple[list[str], pd.DataFrame, dict]:
-                new_tasks: list[str] = []
-                table, err = apply_filters(new_tasks, selected_models, q)
-                return new_tasks, table, gr.update(value=err, visible=bool(err))
             for preset, btn in zip(presets, preset_buttons):
                 preset_tasks = preset.get("tasks", [])
                 btn.click(
-                    fn=lambda selected_models, q, preset_tasks=preset_tasks: _apply_with_tasks(
                         preset_tasks,
                         selected_models,
-                        q,
                     ),
-                    inputs=[models, search],
-                    outputs=[tasks, wide_df, error_md],
                 )
-        def _apply_from_controls(selected_tasks, selected_models, q: str):
-            table, err = apply_filters(selected_tasks, selected_models, q)
             return table, gr.update(value=err, visible=bool(err))
         apply_btn.click(
             fn=_apply_from_controls,
-            inputs=[tasks, models, search],
             outputs=[wide_df, error_md],
         )
         reset_btn.click(
             fn=_apply_reset,
-            inputs=[models, search],
-            outputs=[tasks, wide_df, error_md],
         )
-        def _reload_data(selected_tasks, selected_models, q: str):
             nonlocal df
             df = load_results(config)
-            table, err = apply_filters(selected_tasks, selected_models, q)
             return table, gr.update(value=err, visible=bool(err))
-        reload_btn.click(
-            fn=_reload_data,
-            inputs=[tasks, models, search],
-            outputs=[wide_df, error_md],
-        )
     return demo

     server_port: int = 7860
     results_dir: str = "results"
     csv_pattern: str = "*.csv"
+    enable_data_reload: bool = True
     title: str = "Leaderboard"
     description_md: str | None = None
     task_groups: list[dict] | None = None
     def apply_filters(
         selected_tasks: list[str],
         selected_models: list[str],
+        search_task: str,
+        search_model: str,
     ) -> tuple[pd.DataFrame, str]:
         # Use the global df loaded at startup instead of reloading
         if df.empty:
         if selected_models and "model" in filtered.columns:
             filtered = filtered[filtered["model"].astype(str).isin(selected_models)]
+        for search, col in [(search_task, "task"), (search_model, "model")]:
+            if search.strip():
+                needle = search.strip().lower()
+                hay_cols = [col] if col in filtered.columns else []
+                if hay_cols:
+                    mask = False
+                    for c in hay_cols:
+                        mask = mask | filtered[c].astype(str).str.lower().str.contains(
+                            needle, na=False
+                        )
+                    filtered = filtered[mask]
         try:
             return pivot_results(filtered), ""
             )
             return pd.DataFrame(), msg
+    wide0, err0 = apply_filters([], [], "", "")
     if init_error:
         err0 = init_error
         wide0 = pd.DataFrame()
         with gr.Column():
             with gr.Row():
+                with gr.Column():
+                    tasks = gr.Dropdown(
+                        choices=task_choices,
+                        value=[],
+                        multiselect=True,
+                        label="Tasks",
+                    )
+                    search_task = gr.Textbox(label="Search task", placeholder="task regex")
+                with gr.Column():
+                    models = gr.Dropdown(
+                        choices=model_choices,
+                        value=[],
+                        multiselect=True,
+                        label="Models",
+                    )
+                    search_model = gr.Textbox(label="Search model", placeholder="model regex")
             with gr.Row():
                 apply_btn = gr.Button("✅ Apply")
                 reset_btn = gr.Button("🧹 Reset Filters")
+                if config.enable_data_reload:
+                    reload_btn = gr.Button("📊 Reload Data")
+                else:
+                    reload_btn = None
             presets = load_button_presets(config)
             preset_buttons: list[gr.Button] = []
                 wrap=False,
                 max_height=760,
                 visible=len(df) > 0,
+                pinned_columns=2,
             )
             def _apply_with_tasks(
+                preset: list[str],
+                selected_models: list[str],
+                searched_tasks: str,
+                searched_models: str,
             ) -> tuple[list[str], pd.DataFrame, dict]:
                 new_tasks = _set_tasks(preset)
+                table, err = apply_filters(
+                    new_tasks, selected_models, searched_tasks, searched_models
+                )
+                return new_tasks, "", table, gr.update(value=err, visible=bool(err))
+            def _apply_reset() -> tuple[list[str], list[str], str, pd.DataFrame, dict]:
+                table, err = apply_filters([], [], "", "")
+                return [], [], "", "", table, gr.update(value=err, visible=bool(err))
             for preset, btn in zip(presets, preset_buttons):
                 preset_tasks = preset.get("tasks", [])
                 btn.click(
+                    fn=lambda selected_models, sm, preset_tasks=preset_tasks: _apply_with_tasks(
                         preset_tasks,
                         selected_models,
+                        "",
+                        sm,
                     ),
+                    inputs=[models, search_model],
+                    outputs=[tasks, search_task, wide_df, error_md],
                 )
+        def _apply_from_controls(
+            selected_tasks, selected_models, searched_tasks: str, searched_models: str
+        ):
+            table, err = apply_filters(
+                selected_tasks, selected_models, searched_tasks, searched_models
+            )
             return table, gr.update(value=err, visible=bool(err))
         apply_btn.click(
             fn=_apply_from_controls,
+            inputs=[tasks, models, search_task, search_model],
             outputs=[wide_df, error_md],
         )
         reset_btn.click(
             fn=_apply_reset,
+            outputs=[tasks, models, search_task, search_model, wide_df, error_md],
         )
+        def _reload_data(
+            selected_tasks, selected_models, searched_tasks: str, searched_models: str
+        ):
             nonlocal df
             df = load_results(config)
+            table, err = apply_filters(
+                selected_tasks, selected_models, searched_tasks, searched_models
+            )
             return table, gr.update(value=err, visible=bool(err))
+        if reload_btn:
+            reload_btn.click(
+                fn=_reload_data,
+                inputs=[tasks, models, search_task, search_model],
+                outputs=[wide_df, error_md],
+            )
     return demo

config.yaml CHANGED Viewed

@@ -1,25 +1,123 @@
 results_dir: results
 csv_pattern: "*.csv"
-title: "SOOFI Leaderboard"
 # Optional markdown shown under the title.
 description_md: |
   - Use the task group buttons for quick presets.
-  - Use 📊 Reload Data after dropping in new CSVs.
 # Optional task groups that results in buttons filtering tasks on the UI
 task_groups:
-  - id: base
-    label: "🧱 Base"
-    tasks: []
   - id: instruct
     label: "💬 Instruct"
-    tasks: []
   - id: reasoning
     label: "🧠 Reasoning"
     tasks:
-      - aime24:chat_template
-      - aime25:chat_template
 server_name: "0.0.0.0"
 server_port: 7860

 results_dir: results
 csv_pattern: "*.csv"
+enable_data_reload: false
+title: "Leaderboard"
 # Optional markdown shown under the title.
 description_md: |
   - Use the task group buttons for quick presets.
 # Optional task groups that results in buttons filtering tasks on the UI
 task_groups:
   - id: instruct
     label: "💬 Instruct"
+    tasks:
+      - hellaswag
+      - arc_challenge
+      - truthfulqa_mc2
+      - mmlu
+      - global_piqa_completions_eng_latn
+      - winogrande
+      - openbookqa
+      - leaderboard_bbh
+      - ogx_hellaswagx_de
+      - ogx_hellaswagx_fr
+      - ogx_hellaswagx_es
+      - ogx_hellaswagx_it
+      - ogx_arcx_challenge_de
+      - ogx_arcx_challenge_fr
+      - ogx_arcx_challenge_es
+      - ogx_arcx_challenge_it
+      - ogx_truthfulqax_mc2_de
+      - ogx_truthfulqax_mc2_fr
+      - ogx_truthfulqax_mc2_es
+      - ogx_truthfulqax_mc2_it
+      - global_mmlu_de
+      - global_mmlu_fr
+      - global_mmlu_es
+      - global_mmlu_it
+      - global_piqa_completions_deu_latn
+      - global_piqa_completions_fra_latn_fran
+      - global_piqa_completions_ita_latn
+      - global_piqa_completions_spa_latn_spai
+      - leaderboard_ifeval
+      - ifbench
+      - ifeval_DE
+      - ifbench_DE
+      - ifbench_FR
+      - ifbench_IT
+      - ifbench_ES
+      - gsm_plus
+      - gsm8k
+      - ogx_gsm8kx_de
+      - ogx_gsm8kx_fr
+      - ogx_gsm8kx_es
+      - ogx_gsm8kx_it
   - id: reasoning
     label: "🧠 Reasoning"
     tasks:
+      - aime24
+      - aime25
+      - gpqa_diamond
+  - id: EN
+    label: "🇬🇧"
+    tasks:
+      - hellaswag
+      - arc_challenge
+      - truthfulqa_mc2
+      - mmlu
+      - global_piqa_completions_eng_latn
+      - winogrande
+      - openbookqa
+      - leaderboard_bbh
+      - leaderboard_ifeval
+      - ifbench
+      - gsm_plus
+      - gsm8k
+      - aime24
+      - aime25
+      - gpqa_diamond
+  - id: DE
+    label: "🇩🇪"
+    tasks:
+      - ogx_hellaswagx_de
+      - ogx_arcx_challenge_de
+      - ogx_truthfulqax_mc2_de
+      - ogx_gsm8kx_de
+      - global_mmlu_de
+      - global_piqa_completions_deu_latn
+      - ifeval_DE
+      - ifbench_DE
+  - id: FR
+    label: "🇫🇷"
+    tasks:
+      - ogx_hellaswagx_fr
+      - ogx_arcx_challenge_fr
+      - ogx_truthfulqax_mc2_fr
+      - ogx_gsm8kx_fr
+      - global_mmlu_fr
+      - global_piqa_completions_fra_latn_fran
+      - ifbench_FR
+  - id: IT
+    label: "🇮🇹"
+    tasks:
+      - ogx_hellaswagx_it
+      - ogx_arcx_challenge_it
+      - ogx_truthfulqax_mc2_it
+      - ogx_gsm8kx_it
+      - global_mmlu_it
+      - global_piqa_completions_ita_latn
+      - ifbench_IT
+  - id: ES
+    label: "🇪🇸"
+    tasks:
+      - ogx_hellaswagx_es
+      - ogx_arcx_challenge_es
+      - ogx_truthfulqax_mc2_es
+      - ogx_gsm8kx_es
+      - global_mmlu_es
+      - global_piqa_completions_spa_latn_spai
+      - ifbench_ES
 server_name: "0.0.0.0"
 server_port: 7860