viktorhangya commited on
Commit
c65bc42
ยท
1 Parent(s): 28f48c2
Files changed (5) hide show
  1. AGENTS.md +0 -153
  2. LICENSE.txt +36 -0
  3. README.md +14 -0
  4. app.py +73 -57
  5. config.yaml +106 -8
AGENTS.md DELETED
@@ -1,153 +0,0 @@
1
- # AGENTS.md
2
-
3
- This repository is a small Hugging Face Space-style app that renders a model
4
- leaderboard from local CSV files under `results/` (`results/all_results_*.csv`).
5
-
6
- ## Quickstart
7
-
8
- - Python: use `python3` (3.10+ recommended)
9
- - App entrypoint: `app.py`
10
- - Data: CSVs under `results/` by default
11
-
12
- Create a local venv:
13
-
14
- ```bash
15
- python -m venv .venv
16
- source .venv/bin/activate
17
- python -m pip install -U pip
18
- python -m pip install gradio pandas
19
- ```
20
-
21
- Run the app locally:
22
-
23
- ```bash
24
- python app.py
25
- # open http://127.0.0.1:7860
26
- ```
27
-
28
- Configure CSV location/pattern:
29
-
30
- ```bash
31
- LEADERBOARD_CSV_GLOB='results/*.csv' python app.py
32
- ```
33
-
34
- Default pattern is `results/all_results_*.csv`.
35
-
36
- Common environment variables:
37
-
38
- - `LEADERBOARD_CSV_GLOB`: glob passed to `glob.glob()` (default `results/all_results_*.csv`)
39
- - `PORT`: Gradio server port (default `7860`)
40
- - `SERVER_NAME`: bind host (default `127.0.0.1`)
41
-
42
- ## Build / Lint / Test
43
-
44
- There is currently no formal build system or test suite configured.
45
-
46
- Recommended commands for agents (safe, minimal):
47
-
48
- - Type-check (optional): set up `pyright` locally if desired
49
- - `python -m pip install pyright`
50
- - `pyright app.py`
51
-
52
- - Format (optional): use `black` for consistent formatting
53
- - `python -m pip install black`
54
- - `black app.py`
55
-
56
- - Lint (optional): use `ruff`
57
- - `python -m pip install ruff`
58
- - `ruff check app.py`
59
-
60
- If you add tests (preferred: `pytest`):
61
-
62
- - Run all tests:
63
- - `pytest`
64
- - Run a single test file:
65
- - `pytest tests/test_app.py`
66
- - Run a single test case:
67
- - `pytest tests/test_app.py -k test_pivot_results`
68
-
69
- When adding tooling, keep it lightweight (single-file project) and avoid
70
- introducing heavy configuration unless requested.
71
-
72
- ## CSV Input Contract
73
-
74
- The app expects โ€œlong-formโ€ rows that look like:
75
-
76
- - `model`: model name
77
- - `task`: task identifier
78
- - `score`: numeric score
79
-
80
- Optional columns used for filtering/processing:
81
-
82
- - `date`: parsed with `pandas.to_datetime`
83
- - `type_score`, `task_specifics`, `score_std`, `source_file`
84
-
85
- Notes:
86
-
87
- - CSVs may contain `Unnamed: 0` (pandas index column); the app drops `Unnamed:*`.
88
- - If `model` is missing, it is inferred from the filename prefix `all_results_`.
89
- - Pivot behavior: one row per model, tasks become columns.
90
- - If multiple rows exist for the same `(model, task)`, the app uses `mean`.
91
- - If โ€œLatest onlyโ€ is enabled and `date` exists, the latest row per
92
- `(model, task)` is used.
93
-
94
- ## Code Style Guidelines
95
-
96
- ### General
97
-
98
- - Keep changes minimal; this is a small, single-file app.
99
- - Prefer clarity over cleverness; prioritize maintainability.
100
- - Avoid adding side effects at import time (keep I/O in functions).
101
-
102
- ### Formatting
103
-
104
- - Follow PEP 8.
105
- - Prefer 88-char lines (Black default) if you introduce formatting.
106
- - Use double quotes for strings (current style in `app.py`).
107
-
108
- ### Imports
109
-
110
- - Standard library imports first, then third party (`gradio`, `pandas`).
111
- - Group imports and keep them sorted.
112
- - Avoid importing unused modules; remove dead imports.
113
-
114
- ### Types
115
-
116
- - Type annotate public helpers where it adds clarity, but donโ€™t over-annotate
117
- pandas-heavy code (pandas typing is noisy).
118
- - Prefer built-in generics (`list[str]`, `dict[str, str]`) on Python 3.9+.
119
-
120
- ### Naming
121
-
122
- - Functions: `snake_case`.
123
- - Constants: `UPPER_SNAKE_CASE`.
124
- - Local variables: descriptive (`filtered`, `combined`, `task_cols`).
125
-
126
- ### Error Handling
127
-
128
- - Fail fast with actionable messages.
129
- - Example: if no CSV files match, raise `FileNotFoundError` with hints.
130
- - When parsing/transforming data, be tolerant:
131
- - Use `errors="coerce"` for datetime/numeric conversions.
132
- - Handle empty frames gracefully (return empty tables).
133
-
134
- ### Data Processing Conventions
135
-
136
- - Keep data normalization in one place (`load_results`).
137
- - Keep pivot/aggregation logic in dedicated helpers (`pivot_results`,
138
- `summarize_overall`).
139
- - When changing pivot semantics (e.g., incorporate `type_score`), do it
140
- explicitly and keep UI in sync.
141
-
142
- ### UI Conventions (Gradio)
143
-
144
- - Keep filters at the top; leaderboard table below.
145
- - Avoid expensive recomputation inside callbacks; reuse loaded data and only
146
- derive filtered/pivoted views.
147
-
148
- ## Cursor / Copilot Rules
149
-
150
- - No Cursor rules found (`.cursor/rules/` or `.cursorrules`).
151
- - No GitHub Copilot instructions found (`.github/copilot-instructions.md`).
152
-
153
- If you add those in the future, update this file to reflect them.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
LICENSE.txt ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ------------------------------------------------------------------------------
2
+ Copyright Notice Fraunhofer leaderboard
3
+ ------------------------------------------------------------------------------
4
+
5
+ /****************************************************************************\
6
+ * (C) Copyright Fraunhofer IIS and its partners (2025)
7
+ * All Rights Reserved
8
+ *
9
+ * Please be advised that this software and/or program delivery is
10
+ * Confidential Information of Fraunhofer and subject to and covered by the
11
+ *
12
+ * "Konsortialvereinbarung รผber die Zusammenarbeit im BMWE - Verbundprojekt ยปEU-SAI - Souverรคne KI fรผr Europaยซ (Soofi)"
13
+ * and its Amendments (if any) between you/your company or institution and Fraunhofer IIS
14
+ *
15
+ * You may use this software and/or program only under the terms and
16
+ * conditions described in the above mentioned Konsortialvereinbarung รผber die Zusammenarbeit im BMWE - Verbundprojekt
17
+ * ยปEU-SAI - Souverรคne KI fรผr Europaยซ (Soofi)ยซ. Any other and/or further use requires a separate
18
+ * agreement.
19
+ *
20
+ * This software and/or program is protected by copyright law and
21
+ * international treaties. Any reproduction or distribution of this software
22
+ * and/or program, or any portion of it, may result in severe civil and
23
+ * criminal penalties, and will be prosecuted to the maximum extent possible
24
+ * under law.
25
+ \***************************************************************************/
26
+
27
+ ------------------------------------------------------------------------------
28
+ Third Party Legal Notices
29
+ ------------------------------------------------------------------------------
30
+ Parts of this package may include 3rd party software licensed under terms that
31
+ require Fraunhofer to display certain legal notices. Legal notices are
32
+ found in the folder ThirdPartyLegalNotices/
33
+
34
+ ------------------------------------------------------------------------------
35
+ (C) copyright 2025 by Fraunhofer IIS
36
+ ------------------------------------------------------------------------------
README.md CHANGED
@@ -16,6 +16,8 @@ created by the `soofi-eval` framework.
16
 
17
  ## Install
18
 
 
 
19
  ```bash
20
  python3 -m venv .venv
21
  source .venv/bin/activate
@@ -23,6 +25,18 @@ python -m pip install -U pip
23
  python -m pip install -e .
24
  ```
25
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## Run
27
 
28
  By default, the app expects the CSV files to live under the `results/` directory.
 
16
 
17
  ## Install
18
 
19
+ ### Local install
20
+
21
  ```bash
22
  python3 -m venv .venv
23
  source .venv/bin/activate
 
25
  python -m pip install -e .
26
  ```
27
 
28
+ ### Hugging Face Spaces
29
+
30
+ This repo is compatible with Hugging Face Spaces (Gradio) and uses the root
31
+ `app.py` as the Space entrypoint.
32
+
33
+ On Spaces, dependencies are installed from `requirements.txt` automatically.
34
+ If you're running a similar containerized setup, you can also install with:
35
+
36
+ ```bash
37
+ python -m pip install -r requirements.txt
38
+ ```
39
+
40
  ## Run
41
 
42
  By default, the app expects the CSV files to live under the `results/` directory.
app.py CHANGED
@@ -17,6 +17,7 @@ class LeaderboardConfig:
17
  server_port: int = 7860
18
  results_dir: str = "results"
19
  csv_pattern: str = "*.csv"
 
20
  title: str = "Leaderboard"
21
  description_md: str | None = None
22
  task_groups: list[dict] | None = None
@@ -296,7 +297,8 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
296
  def apply_filters(
297
  selected_tasks: list[str],
298
  selected_models: list[str],
299
- search: str,
 
300
  ) -> tuple[pd.DataFrame, str]:
301
  # Use the global df loaded at startup instead of reloading
302
  if df.empty:
@@ -337,20 +339,17 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
337
  if selected_models and "model" in filtered.columns:
338
  filtered = filtered[filtered["model"].astype(str).isin(selected_models)]
339
 
340
- if search.strip():
341
- needle = search.strip().lower()
342
- hay_cols = [
343
- c
344
- for c in ["model", "task", "type_score", "source_file"]
345
- if c in filtered.columns
346
- ]
347
- if hay_cols:
348
- mask = False
349
- for c in hay_cols:
350
- mask = mask | filtered[c].astype(str).str.lower().str.contains(
351
- needle, na=False
352
- )
353
- filtered = filtered[mask]
354
 
355
  try:
356
  return pivot_results(filtered), ""
@@ -362,7 +361,7 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
362
  )
363
  return pd.DataFrame(), msg
364
 
365
- wide0, err0 = apply_filters([], [], "")
366
  if init_error:
367
  err0 = init_error
368
  wide0 = pd.DataFrame()
@@ -374,26 +373,31 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
374
 
375
  with gr.Column():
376
  with gr.Row():
377
- tasks = gr.Dropdown(
378
- choices=task_choices,
379
- value=[],
380
- multiselect=True,
381
- label="Tasks (columns)",
382
- )
383
- models = gr.Dropdown(
384
- choices=model_choices,
385
- value=[],
386
- multiselect=True,
387
- label="Models",
388
- )
389
-
390
- with gr.Row():
391
- search = gr.Textbox(label="Search", placeholder="model/task/typeโ€ฆ")
 
 
392
 
393
  with gr.Row():
394
  apply_btn = gr.Button("โœ… Apply")
395
  reset_btn = gr.Button("๐Ÿงน Reset Filters")
396
- reload_btn = gr.Button("๐Ÿ“Š Reload Data")
 
 
 
397
 
398
  presets = load_button_presets(config)
399
  preset_buttons: list[gr.Button] = []
@@ -416,61 +420,73 @@ def build_ui(config: LeaderboardConfig) -> gr.Blocks:
416
  wrap=False,
417
  max_height=760,
418
  visible=len(df) > 0,
 
419
  )
420
 
421
  def _apply_with_tasks(
422
- preset: list[str], selected_models, q: str
 
 
 
423
  ) -> tuple[list[str], pd.DataFrame, dict]:
424
  new_tasks = _set_tasks(preset)
425
- table, err = apply_filters(new_tasks, selected_models, q)
426
- return new_tasks, table, gr.update(value=err, visible=bool(err))
 
 
427
 
428
- def _apply_reset(
429
- selected_models, q: str
430
- ) -> tuple[list[str], pd.DataFrame, dict]:
431
- new_tasks: list[str] = []
432
- table, err = apply_filters(new_tasks, selected_models, q)
433
- return new_tasks, table, gr.update(value=err, visible=bool(err))
434
 
435
  for preset, btn in zip(presets, preset_buttons):
436
  preset_tasks = preset.get("tasks", [])
437
 
438
  btn.click(
439
- fn=lambda selected_models, q, preset_tasks=preset_tasks: _apply_with_tasks(
440
  preset_tasks,
441
  selected_models,
442
- q,
 
443
  ),
444
- inputs=[models, search],
445
- outputs=[tasks, wide_df, error_md],
446
  )
447
 
448
- def _apply_from_controls(selected_tasks, selected_models, q: str):
449
- table, err = apply_filters(selected_tasks, selected_models, q)
 
 
 
 
450
  return table, gr.update(value=err, visible=bool(err))
451
 
452
  apply_btn.click(
453
  fn=_apply_from_controls,
454
- inputs=[tasks, models, search],
455
  outputs=[wide_df, error_md],
456
  )
457
  reset_btn.click(
458
  fn=_apply_reset,
459
- inputs=[models, search],
460
- outputs=[tasks, wide_df, error_md],
461
  )
462
 
463
- def _reload_data(selected_tasks, selected_models, q: str):
 
 
464
  nonlocal df
465
  df = load_results(config)
466
- table, err = apply_filters(selected_tasks, selected_models, q)
 
 
467
  return table, gr.update(value=err, visible=bool(err))
468
 
469
- reload_btn.click(
470
- fn=_reload_data,
471
- inputs=[tasks, models, search],
472
- outputs=[wide_df, error_md],
473
- )
 
474
 
475
  return demo
476
 
 
17
  server_port: int = 7860
18
  results_dir: str = "results"
19
  csv_pattern: str = "*.csv"
20
+ enable_data_reload: bool = True
21
  title: str = "Leaderboard"
22
  description_md: str | None = None
23
  task_groups: list[dict] | None = None
 
297
  def apply_filters(
298
  selected_tasks: list[str],
299
  selected_models: list[str],
300
+ search_task: str,
301
+ search_model: str,
302
  ) -> tuple[pd.DataFrame, str]:
303
  # Use the global df loaded at startup instead of reloading
304
  if df.empty:
 
339
  if selected_models and "model" in filtered.columns:
340
  filtered = filtered[filtered["model"].astype(str).isin(selected_models)]
341
 
342
+ for search, col in [(search_task, "task"), (search_model, "model")]:
343
+ if search.strip():
344
+ needle = search.strip().lower()
345
+ hay_cols = [col] if col in filtered.columns else []
346
+ if hay_cols:
347
+ mask = False
348
+ for c in hay_cols:
349
+ mask = mask | filtered[c].astype(str).str.lower().str.contains(
350
+ needle, na=False
351
+ )
352
+ filtered = filtered[mask]
 
 
 
353
 
354
  try:
355
  return pivot_results(filtered), ""
 
361
  )
362
  return pd.DataFrame(), msg
363
 
364
+ wide0, err0 = apply_filters([], [], "", "")
365
  if init_error:
366
  err0 = init_error
367
  wide0 = pd.DataFrame()
 
373
 
374
  with gr.Column():
375
  with gr.Row():
376
+ with gr.Column():
377
+ tasks = gr.Dropdown(
378
+ choices=task_choices,
379
+ value=[],
380
+ multiselect=True,
381
+ label="Tasks",
382
+ )
383
+ search_task = gr.Textbox(label="Search task", placeholder="task regex")
384
+
385
+ with gr.Column():
386
+ models = gr.Dropdown(
387
+ choices=model_choices,
388
+ value=[],
389
+ multiselect=True,
390
+ label="Models",
391
+ )
392
+ search_model = gr.Textbox(label="Search model", placeholder="model regex")
393
 
394
  with gr.Row():
395
  apply_btn = gr.Button("โœ… Apply")
396
  reset_btn = gr.Button("๐Ÿงน Reset Filters")
397
+ if config.enable_data_reload:
398
+ reload_btn = gr.Button("๐Ÿ“Š Reload Data")
399
+ else:
400
+ reload_btn = None
401
 
402
  presets = load_button_presets(config)
403
  preset_buttons: list[gr.Button] = []
 
420
  wrap=False,
421
  max_height=760,
422
  visible=len(df) > 0,
423
+ pinned_columns=2,
424
  )
425
 
426
  def _apply_with_tasks(
427
+ preset: list[str],
428
+ selected_models: list[str],
429
+ searched_tasks: str,
430
+ searched_models: str,
431
  ) -> tuple[list[str], pd.DataFrame, dict]:
432
  new_tasks = _set_tasks(preset)
433
+ table, err = apply_filters(
434
+ new_tasks, selected_models, searched_tasks, searched_models
435
+ )
436
+ return new_tasks, "", table, gr.update(value=err, visible=bool(err))
437
 
438
+ def _apply_reset() -> tuple[list[str], list[str], str, pd.DataFrame, dict]:
439
+ table, err = apply_filters([], [], "", "")
440
+ return [], [], "", "", table, gr.update(value=err, visible=bool(err))
 
 
 
441
 
442
  for preset, btn in zip(presets, preset_buttons):
443
  preset_tasks = preset.get("tasks", [])
444
 
445
  btn.click(
446
+ fn=lambda selected_models, sm, preset_tasks=preset_tasks: _apply_with_tasks(
447
  preset_tasks,
448
  selected_models,
449
+ "",
450
+ sm,
451
  ),
452
+ inputs=[models, search_model],
453
+ outputs=[tasks, search_task, wide_df, error_md],
454
  )
455
 
456
+ def _apply_from_controls(
457
+ selected_tasks, selected_models, searched_tasks: str, searched_models: str
458
+ ):
459
+ table, err = apply_filters(
460
+ selected_tasks, selected_models, searched_tasks, searched_models
461
+ )
462
  return table, gr.update(value=err, visible=bool(err))
463
 
464
  apply_btn.click(
465
  fn=_apply_from_controls,
466
+ inputs=[tasks, models, search_task, search_model],
467
  outputs=[wide_df, error_md],
468
  )
469
  reset_btn.click(
470
  fn=_apply_reset,
471
+ outputs=[tasks, models, search_task, search_model, wide_df, error_md],
 
472
  )
473
 
474
+ def _reload_data(
475
+ selected_tasks, selected_models, searched_tasks: str, searched_models: str
476
+ ):
477
  nonlocal df
478
  df = load_results(config)
479
+ table, err = apply_filters(
480
+ selected_tasks, selected_models, searched_tasks, searched_models
481
+ )
482
  return table, gr.update(value=err, visible=bool(err))
483
 
484
+ if reload_btn:
485
+ reload_btn.click(
486
+ fn=_reload_data,
487
+ inputs=[tasks, models, search_task, search_model],
488
+ outputs=[wide_df, error_md],
489
+ )
490
 
491
  return demo
492
 
config.yaml CHANGED
@@ -1,25 +1,123 @@
1
  results_dir: results
2
  csv_pattern: "*.csv"
 
3
 
4
- title: "SOOFI Leaderboard"
5
  # Optional markdown shown under the title.
6
  description_md: |
7
  - Use the task group buttons for quick presets.
8
- - Use ๐Ÿ“Š Reload Data after dropping in new CSVs.
9
 
10
  # Optional task groups that results in buttons filtering tasks on the UI
11
  task_groups:
12
- - id: base
13
- label: "๐Ÿงฑ Base"
14
- tasks: []
15
  - id: instruct
16
  label: "๐Ÿ’ฌ Instruct"
17
- tasks: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  - id: reasoning
19
  label: "๐Ÿง  Reasoning"
20
  tasks:
21
- - aime24:chat_template
22
- - aime25:chat_template
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  server_name: "0.0.0.0"
25
  server_port: 7860
 
1
  results_dir: results
2
  csv_pattern: "*.csv"
3
+ enable_data_reload: false
4
 
5
+ title: "Leaderboard"
6
  # Optional markdown shown under the title.
7
  description_md: |
8
  - Use the task group buttons for quick presets.
 
9
 
10
  # Optional task groups that results in buttons filtering tasks on the UI
11
  task_groups:
 
 
 
12
  - id: instruct
13
  label: "๐Ÿ’ฌ Instruct"
14
+ tasks:
15
+ - hellaswag
16
+ - arc_challenge
17
+ - truthfulqa_mc2
18
+ - mmlu
19
+ - global_piqa_completions_eng_latn
20
+ - winogrande
21
+ - openbookqa
22
+ - leaderboard_bbh
23
+ - ogx_hellaswagx_de
24
+ - ogx_hellaswagx_fr
25
+ - ogx_hellaswagx_es
26
+ - ogx_hellaswagx_it
27
+ - ogx_arcx_challenge_de
28
+ - ogx_arcx_challenge_fr
29
+ - ogx_arcx_challenge_es
30
+ - ogx_arcx_challenge_it
31
+ - ogx_truthfulqax_mc2_de
32
+ - ogx_truthfulqax_mc2_fr
33
+ - ogx_truthfulqax_mc2_es
34
+ - ogx_truthfulqax_mc2_it
35
+ - global_mmlu_de
36
+ - global_mmlu_fr
37
+ - global_mmlu_es
38
+ - global_mmlu_it
39
+ - global_piqa_completions_deu_latn
40
+ - global_piqa_completions_fra_latn_fran
41
+ - global_piqa_completions_ita_latn
42
+ - global_piqa_completions_spa_latn_spai
43
+ - leaderboard_ifeval
44
+ - ifbench
45
+ - ifeval_DE
46
+ - ifbench_DE
47
+ - ifbench_FR
48
+ - ifbench_IT
49
+ - ifbench_ES
50
+ - gsm_plus
51
+ - gsm8k
52
+ - ogx_gsm8kx_de
53
+ - ogx_gsm8kx_fr
54
+ - ogx_gsm8kx_es
55
+ - ogx_gsm8kx_it
56
  - id: reasoning
57
  label: "๐Ÿง  Reasoning"
58
  tasks:
59
+ - aime24
60
+ - aime25
61
+ - gpqa_diamond
62
+ - id: EN
63
+ label: "๐Ÿ‡ฌ๐Ÿ‡ง"
64
+ tasks:
65
+ - hellaswag
66
+ - arc_challenge
67
+ - truthfulqa_mc2
68
+ - mmlu
69
+ - global_piqa_completions_eng_latn
70
+ - winogrande
71
+ - openbookqa
72
+ - leaderboard_bbh
73
+ - leaderboard_ifeval
74
+ - ifbench
75
+ - gsm_plus
76
+ - gsm8k
77
+ - aime24
78
+ - aime25
79
+ - gpqa_diamond
80
+ - id: DE
81
+ label: "๐Ÿ‡ฉ๐Ÿ‡ช"
82
+ tasks:
83
+ - ogx_hellaswagx_de
84
+ - ogx_arcx_challenge_de
85
+ - ogx_truthfulqax_mc2_de
86
+ - ogx_gsm8kx_de
87
+ - global_mmlu_de
88
+ - global_piqa_completions_deu_latn
89
+ - ifeval_DE
90
+ - ifbench_DE
91
+ - id: FR
92
+ label: "๐Ÿ‡ซ๐Ÿ‡ท"
93
+ tasks:
94
+ - ogx_hellaswagx_fr
95
+ - ogx_arcx_challenge_fr
96
+ - ogx_truthfulqax_mc2_fr
97
+ - ogx_gsm8kx_fr
98
+ - global_mmlu_fr
99
+ - global_piqa_completions_fra_latn_fran
100
+ - ifbench_FR
101
+ - id: IT
102
+ label: "๐Ÿ‡ฎ๐Ÿ‡น"
103
+ tasks:
104
+ - ogx_hellaswagx_it
105
+ - ogx_arcx_challenge_it
106
+ - ogx_truthfulqax_mc2_it
107
+ - ogx_gsm8kx_it
108
+ - global_mmlu_it
109
+ - global_piqa_completions_ita_latn
110
+ - ifbench_IT
111
+ - id: ES
112
+ label: "๐Ÿ‡ช๐Ÿ‡ธ"
113
+ tasks:
114
+ - ogx_hellaswagx_es
115
+ - ogx_arcx_challenge_es
116
+ - ogx_truthfulqax_mc2_es
117
+ - ogx_gsm8kx_es
118
+ - global_mmlu_es
119
+ - global_piqa_completions_spa_latn_spai
120
+ - ifbench_ES
121
 
122
  server_name: "0.0.0.0"
123
  server_port: 7860