brycemeetkai commited on
Commit
caf09eb
·
verified ·
1 Parent(s): a540a5c

Mirror evals/ from c1978f83e59e

Browse files
evals/albanian/README.md CHANGED
@@ -7,24 +7,21 @@ Albanian (Tosk, `als_Latn` / macro `sq`) evaluation suite for the
7
 
8
  ### Custom Tasks (require `--include_path`)
9
 
10
- | # | Task Name | Category | Dataset (HuggingFace) | Metric |
11
- | --- | ---------------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------------ |
12
- | 1 | `albanian_sib200` | Classification | `Davlan/sib200` (`als_Latn`) | f1_macro |
13
- | 2 | `albanian_belebele` | MCQ | `facebook/belebele` (`als_Latn`) | f1_macro |
14
- | 3 | `albanian_global_mmlu` | MCQ | `CohereLabs/Global-MMLU-Lite` (`sq`, v2) | f1_macro |
15
- | 4 | `albanian_massivesumm_short` | Summarization | `MaLA-LM/MassiveSumm_short` (filtered `language=sqi`) | rouge_l |
16
- | 5 | `albanian_massivesumm_long` | Summarization | `MaLA-LM/MassiveSumm_long` (filtered `language=sqi`) | rouge_l |
17
- | 6 | `albanian_aya` | Open generation | `CohereLabs/aya_evaluation_suite` (`dolly_machine_translated`, filtered `language=sqi`) | llm_judge_score |
18
- | 7 | `albanian_polywrite` | Open generation | `MaLA-LM/PolyWrite` (filtered `lang_script=sqi_Latn`) | open_quality_score |
19
 
20
  #### Subgroups
21
 
22
- | Group | Tasks |
23
- | -------------------------- | ----------------------------------- |
24
- | `albanian_classification` | sib200 |
25
- | `albanian_mcq` | belebele, global_mmlu |
26
- | `albanian_summarization` | massivesumm_short, massivesumm_long |
27
- | `albanian_open_generation` | aya, polywrite |
28
 
29
  ## Setup
30
 
@@ -40,7 +37,7 @@ All commands must be run from the `multilingual_bench/` directory:
40
  cd /path/to/functionary_internal/evaluation/multilingual_bench
41
  ```
42
 
43
- ### Run the Entire Albanian Suite (all 7 tasks)
44
 
45
  ```bash
46
  OPENAI_API_KEY="$OPENROUTER_API_KEY" \
@@ -68,7 +65,6 @@ python run_eval.py --models gpt-5-mini --tasks albanian
68
  ```bash
69
  lm_eval --include_path lm_eval_tasks --tasks albanian_classification ...
70
  lm_eval --include_path lm_eval_tasks --tasks albanian_mcq ...
71
- lm_eval --include_path lm_eval_tasks --tasks albanian_summarization ...
72
  lm_eval --include_path lm_eval_tasks --tasks albanian_open_generation ...
73
  ```
74
 
@@ -78,8 +74,6 @@ lm_eval --include_path lm_eval_tasks --tasks albanian_open_generation ...
78
  lm_eval --include_path lm_eval_tasks --tasks albanian_sib200 ...
79
  lm_eval --include_path lm_eval_tasks --tasks albanian_belebele ...
80
  lm_eval --include_path lm_eval_tasks --tasks albanian_global_mmlu ...
81
- lm_eval --include_path lm_eval_tasks --tasks albanian_massivesumm_short ...
82
- lm_eval --include_path lm_eval_tasks --tasks albanian_massivesumm_long ...
83
  lm_eval --include_path lm_eval_tasks --tasks albanian_aya ...
84
  lm_eval --include_path lm_eval_tasks --tasks albanian_polywrite ...
85
  ```
@@ -93,22 +87,19 @@ With `--log_samples`, the output directory contains:
93
 
94
  ## Dataset Sources
95
 
96
- | Dataset | Source | Config | Notes |
97
- | ----------------- | --------------------------------- | -------------------------------------------------- | -------------------------------------------------------------------- |
98
- | SIB-200 | `Davlan/sib200` | `als_Latn` | text + ClassLabel `category` (7 topics) |
99
- | Belebele | `facebook/belebele` | `als_Latn` | flores_passage + question + 4 mc_answers, `correct_answer_num` 1-4 |
100
- | Global-MMLU-Lite | `CohereLabs/Global-MMLU-Lite` | `sq` | question + `option_a..d` + `answer` letter (400 samples, CS+CA) |
101
- | MassiveSumm short | `MaLA-LM/MassiveSumm_short` | (filter `language=sqi`) | `text`, `summary`, `language`; gated |
102
- | MassiveSumm long | `MaLA-LM/MassiveSumm_long` | — (filter `language=sqi`) | same schema; longer articles |
103
- | Aya Eval | `CohereLabs/aya_evaluation_suite` | `dolly_machine_translated` (filter `language=sqi`) | `inputs`, `targets`, `language`, `script` |
104
- | PolyWrite | `MaLA-LM/PolyWrite` | — (filter `lang_script=sqi_Latn`) | `prompt_translated`, `category`, `lang_script` (no reference answer) |
105
 
106
  ### Gated datasets
107
 
108
- Several upstream datasets are gated on Hugging Face. Accept the terms (once) and export an HF token before running:
109
 
110
  - Aya Eval: <https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite>
111
- - MassiveSumm short / long: <https://huggingface.co/datasets/MaLA-LM/MassiveSumm_short> and <https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long>
112
 
113
  ```bash
114
  export HF_TOKEN="hf_..."
 
7
 
8
  ### Custom Tasks (require `--include_path`)
9
 
10
+ | # | Task Name | Category | Dataset (HuggingFace) | Metric |
11
+ | --- | ---------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------------ |
12
+ | 1 | `albanian_sib200` | Classification | `Davlan/sib200` (`als_Latn`) | f1_macro |
13
+ | 2 | `albanian_belebele` | MCQ | `facebook/belebele` (`als_Latn`) | f1_macro |
14
+ | 3 | `albanian_global_mmlu` | MCQ | `CohereLabs/Global-MMLU-Lite` (`sq`, v2) | f1_macro |
15
+ | 4 | `albanian_aya` | Open generation | `CohereLabs/aya_evaluation_suite` (`dolly_machine_translated`, filtered `language=sqi`) | llm_judge_score |
16
+ | 5 | `albanian_polywrite` | Open generation | `MaLA-LM/PolyWrite` (filtered `lang_script=sqi_Latn`) | open_quality_score |
 
 
17
 
18
  #### Subgroups
19
 
20
+ | Group | Tasks |
21
+ | -------------------------- | --------------------- |
22
+ | `albanian_classification` | sib200 |
23
+ | `albanian_mcq` | belebele, global_mmlu |
24
+ | `albanian_open_generation` | aya, polywrite |
 
25
 
26
  ## Setup
27
 
 
37
  cd /path/to/functionary_internal/evaluation/multilingual_bench
38
  ```
39
 
40
+ ### Run the Entire Albanian Suite (all 5 tasks)
41
 
42
  ```bash
43
  OPENAI_API_KEY="$OPENROUTER_API_KEY" \
 
65
  ```bash
66
  lm_eval --include_path lm_eval_tasks --tasks albanian_classification ...
67
  lm_eval --include_path lm_eval_tasks --tasks albanian_mcq ...
 
68
  lm_eval --include_path lm_eval_tasks --tasks albanian_open_generation ...
69
  ```
70
 
 
74
  lm_eval --include_path lm_eval_tasks --tasks albanian_sib200 ...
75
  lm_eval --include_path lm_eval_tasks --tasks albanian_belebele ...
76
  lm_eval --include_path lm_eval_tasks --tasks albanian_global_mmlu ...
 
 
77
  lm_eval --include_path lm_eval_tasks --tasks albanian_aya ...
78
  lm_eval --include_path lm_eval_tasks --tasks albanian_polywrite ...
79
  ```
 
87
 
88
  ## Dataset Sources
89
 
90
+ | Dataset | Source | Config | Notes |
91
+ | ---------------- | --------------------------------- | -------------------------------------------------- | -------------------------------------------------------------------- |
92
+ | SIB-200 | `Davlan/sib200` | `als_Latn` | text + ClassLabel `category` (7 topics) |
93
+ | Belebele | `facebook/belebele` | `als_Latn` | flores_passage + question + 4 mc_answers, `correct_answer_num` 1-4 |
94
+ | Global-MMLU-Lite | `CohereLabs/Global-MMLU-Lite` | `sq` | question + `option_a..d` + `answer` letter (400 samples, CS+CA) |
95
+ | Aya Eval | `CohereLabs/aya_evaluation_suite` | `dolly_machine_translated` (filter `language=sqi`) | `inputs`, `targets`, `language`, `script` |
96
+ | PolyWrite | `MaLA-LM/PolyWrite` | — (filter `lang_script=sqi_Latn`) | `prompt_translated`, `category`, `lang_script` (no reference answer) |
 
 
97
 
98
  ### Gated datasets
99
 
100
+ The Aya Eval dataset is gated on Hugging Face. Accept the terms (once) and export an HF token before running:
101
 
102
  - Aya Eval: <https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite>
 
103
 
104
  ```bash
105
  export HF_TOKEN="hf_..."
evals/albanian/albanian.yaml CHANGED
@@ -9,13 +9,11 @@
9
  #
10
  # Metrics:
11
  # classification & mcq → f1_macro (per sub-group)
12
- # summarization → rouge_l (per sub-group)
13
  # open_generation → llm_judge_score / open_quality_score (per sub-group)
14
  group: albanian
15
  task:
16
  - albanian_classification
17
  - albanian_mcq
18
- - albanian_summarization
19
  - albanian_open_generation
20
  metadata:
21
  version: 1.0
 
9
  #
10
  # Metrics:
11
  # classification & mcq → f1_macro (per sub-group)
 
12
  # open_generation → llm_judge_score / open_quality_score (per sub-group)
13
  group: albanian
14
  task:
15
  - albanian_classification
16
  - albanian_mcq
 
17
  - albanian_open_generation
18
  metadata:
19
  version: 1.0
evals/albanian/summarization/_default_summarization_yaml DELETED
@@ -1,18 +0,0 @@
1
- # Shared config for Albanian summarization tasks (MassiveSumm).
2
- # The Albanian-only filter is applied in process_docs (the upstream
3
- # datasets are highly multilingual single-table dumps, no per-language
4
- # config). Scoring is sentence-level ROUGE-L F1.
5
- output_type: generate_until
6
- test_split: train
7
- generation_kwargs:
8
- do_sample: false
9
- max_gen_toks: 8192
10
- until:
11
- - "<|endoftext|>"
12
- process_results: !function utils.process_results
13
- metric_list:
14
- - metric: rouge_l
15
- aggregation: !function utils.rouge_l_agg
16
- higher_is_better: true
17
- metadata:
18
- version: 1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/albanian/summarization/albanian_massivesumm_long.yaml DELETED
@@ -1,14 +0,0 @@
1
- task: albanian_massivesumm_long
2
- task_alias: massivesumm_long
3
- # Gated dataset: accept terms at
4
- # https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long and export HF_TOKEN.
5
- dataset_path: MaLA-LM/MassiveSumm_long
6
- include: _default_summarization_yaml
7
- process_docs: !function utils.process_docs
8
- generation_kwargs:
9
- do_sample: false
10
- max_gen_toks: 8192
11
- until:
12
- - "<|endoftext|>"
13
- doc_to_text: "Ti je një sistem përmbledhjeje lajmesh.\nPërmblidh artikullin e mëposhtëm në një paragraf të shkurtër (3-5 fjali) në gjuhën shqipe. Mos shto komente.\n\nArtikulli:\n{{text}}\n\nPërmbledhja:"
14
- doc_to_target: "{{summary}}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/albanian/summarization/albanian_massivesumm_short.yaml DELETED
@@ -1,9 +0,0 @@
1
- task: albanian_massivesumm_short
2
- task_alias: massivesumm_short
3
- # Gated dataset: accept terms at
4
- # https://huggingface.co/datasets/MaLA-LM/MassiveSumm_short and export HF_TOKEN.
5
- dataset_path: MaLA-LM/MassiveSumm_short
6
- include: _default_summarization_yaml
7
- process_docs: !function utils.process_docs
8
- doc_to_text: "Ti je një sistem përmbledhjeje lajmesh.\nPërmblidh artikullin e mëposhtëm në një ose dy fjali të shkurtra në gjuhën shqipe. Mos shto komente.\n\nArtikulli:\n{{text}}\n\nPërmbledhja:"
9
- doc_to_target: "{{summary}}"
 
 
 
 
 
 
 
 
 
 
evals/albanian/summarization/albanian_summarization.yaml DELETED
@@ -1,11 +0,0 @@
1
- # Summarization subgroup (MassiveSumm short + long)
2
- group: albanian_summarization
3
- task:
4
- - albanian_massivesumm_short
5
- - albanian_massivesumm_long
6
- aggregate_metric_list:
7
- - metric: rouge_l
8
- aggregation: mean
9
- weight_by_size: true
10
- metadata:
11
- version: 1.0
 
 
 
 
 
 
 
 
 
 
 
 
evals/albanian/summarization/utils.py DELETED
@@ -1,111 +0,0 @@
1
- """Utility helpers for Albanian summarization tasks (MassiveSumm short/long).
2
-
3
- Both MassiveSumm subsets are highly multilingual single-table datasets
4
- (one ``train`` split, no language configs). We filter to Albanian rows
5
- inside ``process_docs``. The HF dataset is **gated** — accept the terms
6
- on the dataset page once and export ``HF_TOKEN`` before running.
7
-
8
- Scoring uses ROUGE-L F1 via the ``rouge_score`` package, which is
9
- already a transitive dependency of lm-evaluation-harness.
10
- """
11
-
12
- from __future__ import annotations
13
-
14
- import re
15
- import string
16
-
17
- import datasets
18
-
19
-
20
- _ALBANIAN_LANG_CODES = {"sqi", "als", "aln"}
21
-
22
-
23
- def _strip_think_tags(text: str) -> str:
24
- """Strip <think>...</think> reasoning wrapper (e.g. Qwen thinking models)."""
25
- if "</think>" in text:
26
- return text.split("</think>")[-1].strip()
27
- return text
28
-
29
-
30
- def _filter_albanian(dataset: datasets.Dataset) -> datasets.Dataset:
31
- """Keep rows whose ``language`` field is one of Albanian variants."""
32
- if "language" not in dataset.column_names:
33
- return dataset
34
- return dataset.filter(lambda row: str(row.get("language", "")).lower() in _ALBANIAN_LANG_CODES)
35
-
36
-
37
- def _normalise_doc(doc):
38
- """Project the columns we actually need."""
39
- text = (doc.get("text") or "").strip()
40
- summary = (doc.get("summary") or "").strip()
41
- title = (doc.get("title") or "").strip()
42
- return {
43
- "text": text,
44
- "summary": summary,
45
- "title": title,
46
- }
47
-
48
-
49
- def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
50
- filtered = _filter_albanian(dataset)
51
- return filtered.map(_normalise_doc, remove_columns=[
52
- c for c in filtered.column_names if c not in ("text", "summary", "title")
53
- ])
54
-
55
-
56
- # ── ROUGE-L scoring ──────────────────────────────────────────────────
57
-
58
-
59
- _PUNCT_TABLE = str.maketrans("", "", string.punctuation)
60
- _WHITESPACE_RE = re.compile(r"\s+")
61
-
62
-
63
- def _normalise(text: str) -> str:
64
- text = text.translate(_PUNCT_TABLE)
65
- text = _WHITESPACE_RE.sub(" ", text)
66
- return text.strip().lower()
67
-
68
-
69
- def _lcs_length(a, b):
70
- m, n = len(a), len(b)
71
- if m == 0 or n == 0:
72
- return 0
73
- dp = [0] * (n + 1)
74
- for i in range(1, m + 1):
75
- prev = 0
76
- for j in range(1, n + 1):
77
- tmp = dp[j]
78
- if a[i - 1] == b[j - 1]:
79
- dp[j] = prev + 1
80
- else:
81
- dp[j] = max(dp[j], dp[j - 1])
82
- prev = tmp
83
- return dp[n]
84
-
85
-
86
- def _rouge_l_f1(pred: str, gold: str) -> float:
87
- """Compute sentence-level ROUGE-L F1 (no stemming) between pred and gold."""
88
- pred_tokens = _normalise(pred).split()
89
- gold_tokens = _normalise(gold).split()
90
- if not pred_tokens or not gold_tokens:
91
- return 0.0
92
- lcs = _lcs_length(pred_tokens, gold_tokens)
93
- if lcs == 0:
94
- return 0.0
95
- precision = lcs / len(pred_tokens)
96
- recall = lcs / len(gold_tokens)
97
- return 2 * precision * recall / (precision + recall)
98
-
99
-
100
- def process_results(doc, results):
101
- raw_response = results[0].strip() if results and results[0] else ""
102
- pred = _strip_think_tags(raw_response)
103
- gold = (doc.get("summary") or "").strip()
104
- return {"rouge_l": (gold, pred)}
105
-
106
-
107
- def rouge_l_agg(items):
108
- if not items:
109
- return 0.0
110
- scores = [_rouge_l_f1(pred, gold) for gold, pred in items]
111
- return sum(scores) / len(scores)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/eval_config.toml CHANGED
@@ -29,6 +29,17 @@ apply_chat_template = true
29
  log_samples = true
30
  output_path = "output/results"
31
 
 
 
 
 
 
 
 
 
 
 
 
32
  # ── Hugging Face Hub ─────────────────────────────────────────────────
33
  # Token: export HF_TOKEN="hf_..." (https://huggingface.co/settings/tokens)
34
  #
@@ -161,9 +172,6 @@ name = "albanian_classification"
161
  [[tasks]]
162
  name = "albanian_mcq"
163
 
164
- [[tasks]]
165
- name = "albanian_summarization"
166
-
167
  [[tasks]]
168
  name = "albanian_open_generation"
169
 
@@ -174,9 +182,6 @@ name = "portuguese_mcq"
174
  [[tasks]]
175
  name = "portuguese_classification"
176
 
177
- [[tasks]]
178
- name = "portuguese_nli"
179
-
180
  # Ukrainian
181
  [[tasks]]
182
  name = "ukrainian_classification"
@@ -187,9 +192,6 @@ name = "ukrainian_mcq"
187
  [[tasks]]
188
  name = "ukrainian_qa"
189
 
190
- [[tasks]]
191
- name = "ukrainian_summarization"
192
-
193
  [[tasks]]
194
  name = "ukrainian_open_generation"
195
 
 
29
  log_samples = true
30
  output_path = "output/results"
31
 
32
+ # Resilience knobs forwarded to lm_eval's TemplateAPI (local-chat-completions):
33
+ # max_retries → tenacity stop_after_attempt(N) per request.
34
+ # timeout → aiohttp ClientTimeout(total=SEC) per request.
35
+ # Default lm_eval values (3 / 300) are too low for long CoT tasks against an
36
+ # overloaded SGLang / Functionary endpoint — slow tail requests or server
37
+ # hiccups (ServerDisconnectedError, ConnectionReset, Cloudflare 524) exhaust
38
+ # retries and abort the run. Mirrors functionary_internal's working config:
39
+ # https://github.com/MeetKai/functionary_internal/blob/main/functionary_internal/evaluation/multilingual_bench/lm_eval_tasks/eval_config.toml
40
+ max_retries = 5
41
+ timeout = 1800
42
+
43
  # ── Hugging Face Hub ─────────────────────────────────────────────────
44
  # Token: export HF_TOKEN="hf_..." (https://huggingface.co/settings/tokens)
45
  #
 
172
  [[tasks]]
173
  name = "albanian_mcq"
174
 
 
 
 
175
  [[tasks]]
176
  name = "albanian_open_generation"
177
 
 
182
  [[tasks]]
183
  name = "portuguese_classification"
184
 
 
 
 
185
  # Ukrainian
186
  [[tasks]]
187
  name = "ukrainian_classification"
 
192
  [[tasks]]
193
  name = "ukrainian_qa"
194
 
 
 
 
195
  [[tasks]]
196
  name = "ukrainian_open_generation"
197
 
evals/portuguese/README.md CHANGED
@@ -12,17 +12,13 @@ Portuguese (PT-BR) evaluation suite for the `lm-evaluation-harness` framework.
12
  | 4 | `portuguese_hatebr` | Classification | `eduagarcia/portuguese_benchmark` (HateBR) | f1_macro |
13
  | 5 | `portuguese_hate_speech` | Classification | `eduagarcia/portuguese_benchmark` (Hate Speech) | f1_macro |
14
  | 6 | `portuguese_tweetsentbr` | Classification | `eduagarcia/tweetsentbr_fewshot` | f1_macro |
15
- | 7 | `portuguese_assin2_rte` | NLI | `assin2` | f1_macro |
16
- | 8 | `portuguese_faquad_nli` | NLI | `ruanchaves/faquad-nli` | f1_macro |
17
- | 9 | `portuguese_assin2_sts` | NLI | `assin2` | pearson |
18
 
19
  ### Subgroups
20
 
21
- | Group | Tasks |
22
- | --------------------------- | ---------------------------------- |
23
- | `portuguese_mcq` | enem, bluex, oab_exams |
24
- | `portuguese_classification` | hatebr, hate_speech, tweetsentbr |
25
- | `portuguese_nli` | assin2_rte, faquad_nli, assin2_sts |
26
 
27
  ## Setup
28
 
@@ -39,7 +35,7 @@ cd functionary_internal/evaluation/multilingual_bench/lm_eval_tasks
39
  export INCLUDE_PATH="$(pwd)"
40
  ```
41
 
42
- ### Run the Entire Portuguese (all 9 tasks)
43
 
44
  ```bash
45
  OPENAI_API_KEY="your-key" \
@@ -63,9 +59,6 @@ lm_eval --include_path $INCLUDE_PATH --tasks portuguese_mcq ...
63
 
64
  # Classification (hate speech, sentiment)
65
  lm_eval --include_path $INCLUDE_PATH --tasks portuguese_classification ...
66
-
67
- # Natural Language Inference (ASSIN2 RTE + FaQuAD NLI + ASSIN2 STS)
68
- lm_eval --include_path $INCLUDE_PATH --tasks portuguese_nli ...
69
  ```
70
 
71
  ### Run a Single Task
@@ -85,9 +78,6 @@ lm_eval --include_path $INCLUDE_PATH --tasks portuguese_hatebr ...
85
 
86
  # Sentiment analysis
87
  lm_eval --include_path $INCLUDE_PATH --tasks portuguese_tweetsentbr ...
88
-
89
- # Textual entailment
90
- lm_eval --include_path $INCLUDE_PATH --tasks portuguese_assin2_rte ...
91
  ```
92
 
93
  ### Run with a Local HuggingFace Model
@@ -106,8 +96,8 @@ lm_eval \
106
  ### Mix and Match
107
 
108
  ```bash
109
- # Run ENEM + ASSIN2 RTE only
110
- lm_eval --include_path $INCLUDE_PATH --tasks portuguese_enem,portuguese_assin2_rte ...
111
  ```
112
 
113
  ## Output
@@ -119,13 +109,11 @@ With `--log_samples`, the output directory contains:
119
 
120
  ## Dataset Sources
121
 
122
- | Dataset | Source | Config | Fields |
123
- | ---------------------- | -------------------------------------- | ------------------------------- | ----------------------------------------------------------- |
124
- | ENEM | `eduagarcia/enem_challenge` | — | question, choices, answerKey |
125
- | BLUEX | `eduagarcia-temp/BLUEX_without_images` | — | question, choices, answerKey |
126
- | OAB Exams | `eduagarcia/oab_exams` | — | question, choices, answerKey |
127
- | HateBR | `eduagarcia/portuguese_benchmark` | `HateBR_offensive_binary` | sentence, label |
128
- | Portuguese Hate Speech | `eduagarcia/portuguese_benchmark` | `Portuguese_Hate_Speech_binary` | sentence, label |
129
- | TweetSentBR | `eduagarcia/tweetsentbr_fewshot` | — | sentence, label |
130
- | ASSIN2 | `assin2` | — | premise, hypothesis, entailment_judgment, relatedness_score |
131
- | FaQuAD-NLI | `ruanchaves/faquad-nli` | — | question, answer, label |
 
12
  | 4 | `portuguese_hatebr` | Classification | `eduagarcia/portuguese_benchmark` (HateBR) | f1_macro |
13
  | 5 | `portuguese_hate_speech` | Classification | `eduagarcia/portuguese_benchmark` (Hate Speech) | f1_macro |
14
  | 6 | `portuguese_tweetsentbr` | Classification | `eduagarcia/tweetsentbr_fewshot` | f1_macro |
 
 
 
15
 
16
  ### Subgroups
17
 
18
+ | Group | Tasks |
19
+ | --------------------------- | -------------------------------- |
20
+ | `portuguese_mcq` | enem, bluex, oab_exams |
21
+ | `portuguese_classification` | hatebr, hate_speech, tweetsentbr |
 
22
 
23
  ## Setup
24
 
 
35
  export INCLUDE_PATH="$(pwd)"
36
  ```
37
 
38
+ ### Run the Entire Portuguese (all 6 tasks)
39
 
40
  ```bash
41
  OPENAI_API_KEY="your-key" \
 
59
 
60
  # Classification (hate speech, sentiment)
61
  lm_eval --include_path $INCLUDE_PATH --tasks portuguese_classification ...
 
 
 
62
  ```
63
 
64
  ### Run a Single Task
 
78
 
79
  # Sentiment analysis
80
  lm_eval --include_path $INCLUDE_PATH --tasks portuguese_tweetsentbr ...
 
 
 
81
  ```
82
 
83
  ### Run with a Local HuggingFace Model
 
96
  ### Mix and Match
97
 
98
  ```bash
99
+ # Run ENEM + HateBR only
100
+ lm_eval --include_path $INCLUDE_PATH --tasks portuguese_enem,portuguese_hatebr ...
101
  ```
102
 
103
  ## Output
 
109
 
110
  ## Dataset Sources
111
 
112
+ | Dataset | Source | Config | Fields |
113
+ | ---------------------- | -------------------------------------- | ------------------------------- | ---------------------------- |
114
+ | ENEM | `eduagarcia/enem_challenge` | — | question, choices, answerKey |
115
+ | BLUEX | `eduagarcia-temp/BLUEX_without_images` | — | question, choices, answerKey |
116
+ | OAB Exams | `eduagarcia/oab_exams` | — | question, choices, answerKey |
117
+ | HateBR | `eduagarcia/portuguese_benchmark` | `HateBR_offensive_binary` | sentence, label |
118
+ | Portuguese Hate Speech | `eduagarcia/portuguese_benchmark` | `Portuguese_Hate_Speech_binary` | sentence, label |
119
+ | TweetSentBR | `eduagarcia/tweetsentbr_fewshot` | — | sentence, label |
 
 
evals/portuguese/nli/portuguese_assin2_rte.yaml DELETED
@@ -1,28 +0,0 @@
1
- task: portuguese_assin2_rte
2
- task_alias: assin2_rte
3
- dataset_path: assin2
4
- test_split: test
5
- output_type: generate_until
6
- generation_kwargs:
7
- do_sample: false
8
- max_gen_toks: 8192
9
- until:
10
- - "<|endoftext|>"
11
- process_docs: !function utils.process_assin2_rte_docs
12
- doc_to_text: "Indique se a hipótese pode ser inferida a partir da premissa. Responda apenas com \"Sim\" ou \"Não\".\n\nPremissa: {{premise}}\nHipótese: {{hypothesis}}\nPergunta: A hipótese pode ser inferida pela premissa? Sim ou Não?\nResposta:"
13
- doc_to_target: "{{target}}"
14
- filter_list:
15
- - name: "get_label"
16
- filter:
17
- - function: "strip_think_recover"
18
- - function: "regex"
19
- regex_pattern: "(Sim|Não)"
20
- group_select: 0
21
- - function: "take_first"
22
- process_results: !function utils.process_nli_results
23
- metric_list:
24
- - metric: f1_macro
25
- aggregation: !function utils.macro_f1_agg
26
- higher_is_better: true
27
- metadata:
28
- version: 1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/portuguese/nli/portuguese_assin2_sts.yaml DELETED
@@ -1,22 +0,0 @@
1
- task: portuguese_assin2_sts
2
- task_alias: assin2_sts
3
- dataset_path: assin2
4
- test_split: test
5
- output_type: generate_until
6
- generation_kwargs:
7
- do_sample: false
8
- max_gen_toks: 8192
9
- until:
10
- - "<|endoftext|>"
11
- doc_to_text: "Avalie o grau de similaridade entre as duas frases abaixo. Dê uma pontuação entre 1,0 e 5,0 (1,0 = pouco similar, 5,0 = muito similar).\n\nFrase 1: {{premise}}\nFrase 2: {{hypothesis}}\nPergunta: Quão similares são as duas frases? Dê uma pontuação entre 1,0 a 5,0.\nResposta:"
12
- doc_to_target: !function utils.assin2_float_to_pt_str
13
- process_results: !function utils.process_sts_results
14
- metric_list:
15
- - metric: pearson
16
- aggregation: !function utils.pearson_agg
17
- higher_is_better: true
18
- - metric: mse
19
- aggregation: !function utils.mse_agg
20
- higher_is_better: false
21
- metadata:
22
- version: 1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/portuguese/nli/portuguese_faquad_nli.yaml DELETED
@@ -1,28 +0,0 @@
1
- task: portuguese_faquad_nli
2
- task_alias: faquad_nli
3
- dataset_path: ruanchaves/faquad-nli
4
- test_split: test
5
- output_type: generate_until
6
- generation_kwargs:
7
- do_sample: false
8
- max_gen_toks: 8192
9
- until:
10
- - "<|endoftext|>"
11
- process_docs: !function utils.process_faquad_nli_docs
12
- doc_to_text: "Julgue se a resposta satisfaz à pergunta de maneira satisfatória. Responda apenas com \"Sim\" ou \"Não\".\n\nPergunta: {{question}}\nResposta: {{answer}}\nA resposta dada satisfaz à pergunta? Sim ou Não?"
13
- doc_to_target: "{{target}}"
14
- filter_list:
15
- - name: "get_label"
16
- filter:
17
- - function: "strip_think_recover"
18
- - function: "regex"
19
- regex_pattern: "(Sim|Não)"
20
- group_select: 0
21
- - function: "take_first"
22
- process_results: !function utils.process_nli_results
23
- metric_list:
24
- - metric: f1_macro
25
- aggregation: !function utils.macro_f1_agg
26
- higher_is_better: true
27
- metadata:
28
- version: 1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/portuguese/nli/portuguese_nli.yaml DELETED
@@ -1,18 +0,0 @@
1
- # Natural Language Inference subgroup (ASSIN2 RTE + FaQuAD NLI + ASSIN2 STS)
2
- group: portuguese_nli
3
- task:
4
- - portuguese_assin2_rte
5
- - portuguese_faquad_nli
6
- - portuguese_assin2_sts
7
- aggregate_metric_list:
8
- - metric: f1_macro
9
- aggregation: mean
10
- weight_by_size: true
11
- - metric: pearson
12
- aggregation: mean
13
- weight_by_size: true
14
- - metric: mse
15
- aggregation: mean
16
- weight_by_size: true
17
- metadata:
18
- version: 1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/portuguese/nli/utils.py DELETED
@@ -1,102 +0,0 @@
1
- """Utility helpers for Portuguese NLI / STS tasks (generative mode).
2
-
3
- Covers ASSIN2 (RTE & STS) and FaQuAD-NLI.
4
- """
5
-
6
- import os as _os, sys as _sys # noqa: E401
7
- _sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..","..",)))
8
-
9
-
10
- import math
11
- import re
12
-
13
- from f1_utils import macro_f1_agg, process_results_f1 # noqa: F401
14
-
15
-
16
- # ── Document pre-processing ─────────────────────────────────────────
17
-
18
-
19
- def process_assin2_rte_docs(dataset):
20
- """Map entailment_judgment 0/1 → Não/Sim."""
21
-
22
- def _map(doc):
23
- doc["target"] = "Sim" if doc["entailment_judgment"] == 1 else "Não"
24
- return doc
25
-
26
- return dataset.map(_map)
27
-
28
-
29
- def process_faquad_nli_docs(dataset):
30
- """Map label 0/1 → Não/Sim."""
31
-
32
- def _map(doc):
33
- doc["target"] = "Sim" if doc["label"] == 1 else "Não"
34
- return doc
35
-
36
- return dataset.map(_map)
37
-
38
-
39
- # ── NLI result processing ────────────────────────────────────────────
40
-
41
-
42
- def process_nli_results(doc, results):
43
- """Return (pred, gold) tuple for macro-F1 aggregation."""
44
- return process_results_f1(doc, results)
45
-
46
-
47
- # ── STS helpers ──────────────────────────────────────────────────────
48
-
49
-
50
- def assin2_float_to_pt_str(doc):
51
- """Format relatedness_score as a Portuguese-style decimal string (comma)."""
52
- return "{:.1f}".format(doc["relatedness_score"]).replace(".", ",")
53
-
54
-
55
- def _extract_float(text):
56
- """Extract a float value from text, handling Portuguese comma notation."""
57
- text = text.strip()
58
- # Try to find a number pattern (comma or dot as decimal separator)
59
- match = re.search(r"(\d+[,.]?\d*)", text)
60
- if match:
61
- num_str = match.group(1).replace(",", ".")
62
- try:
63
- val = float(num_str)
64
- return max(1.0, min(5.0, val)) # Clip to [1.0, 5.0]
65
- except ValueError:
66
- pass
67
- return 5.0 # Default fallback (same as original)
68
-
69
-
70
- def process_sts_results(doc, results):
71
- """Extract predicted float and pair with gold for pearson/mse."""
72
- pred_text = results[0].strip() if results[0] else ""
73
- pred_val = _extract_float(pred_text)
74
- gold_val = doc["relatedness_score"]
75
- return {
76
- "pearson": (pred_val, gold_val),
77
- "mse": (pred_val, gold_val),
78
- }
79
-
80
-
81
- def pearson_agg(items):
82
- """Compute Pearson correlation coefficient."""
83
- preds = [item[0] for item in items]
84
- golds = [item[1] for item in items]
85
- n = len(preds)
86
- if n < 2:
87
- return 0.0
88
- mean_p = sum(preds) / n
89
- mean_g = sum(golds) / n
90
- cov = sum((p - mean_p) * (g - mean_g) for p, g in zip(preds, golds)) / n
91
- std_p = math.sqrt(sum((p - mean_p) ** 2 for p in preds) / n)
92
- std_g = math.sqrt(sum((g - mean_g) ** 2 for g in golds) / n)
93
- if std_p * std_g == 0:
94
- return 0.0
95
- return cov / (std_p * std_g)
96
-
97
-
98
- def mse_agg(items):
99
- """Compute mean squared error."""
100
- preds = [item[0] for item in items]
101
- golds = [item[1] for item in items]
102
- return sum((p - g) ** 2 for p, g in zip(preds, golds)) / len(preds)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/portuguese/portuguese.yaml CHANGED
@@ -8,11 +8,9 @@
8
  # Sub-groups:
9
  # mcq → exam multiple-choice (enem, bluex, oab_exams)
10
  # classification → sentiment / hate / emotion classification
11
- # nli → natural language inference & textual similarity
12
  group: portuguese
13
  task:
14
  - portuguese_mcq
15
  - portuguese_classification
16
- - portuguese_nli
17
  metadata:
18
  version: 1.0
 
8
  # Sub-groups:
9
  # mcq → exam multiple-choice (enem, bluex, oab_exams)
10
  # classification → sentiment / hate / emotion classification
 
11
  group: portuguese
12
  task:
13
  - portuguese_mcq
14
  - portuguese_classification
 
15
  metadata:
16
  version: 1.0
evals/run_eval.py CHANGED
@@ -456,10 +456,21 @@ def run_single_eval(
456
  gen_kwargs: str | None = None,
457
  hf_hub: dict | None = None,
458
  endpoint_kind: str = "chat_completions",
 
 
459
  ) -> dict | None:
460
- """Run a single lm_eval evaluation via the Python API and return results."""
 
 
 
 
 
 
461
  _ensure_lm_eval_api_key()
462
- model_args = f"model={model_id},base_url={base_url},num_concurrent={num_concurrent}"
 
 
 
463
 
464
  tracker = _build_evaluation_tracker(output_path, hf_hub)
465
 
@@ -957,6 +968,11 @@ def main():
957
  if args.num_concurrent is not None
958
  else defaults.get("num_concurrent", 5)
959
  )
 
 
 
 
 
960
 
961
  eval_kwargs = dict(
962
  include_path=SCRIPT_DIR,
@@ -976,6 +992,8 @@ def main():
976
  gen_kwargs=gen_kwargs,
977
  hf_hub=hf_hub,
978
  endpoint_kind=model.get("endpoint_kind", "chat_completions"),
 
 
979
  )
980
 
981
  header = f"[{run_idx}/{total}] {model['name']} x {task['name']}"
 
456
  gen_kwargs: str | None = None,
457
  hf_hub: dict | None = None,
458
  endpoint_kind: str = "chat_completions",
459
+ max_retries: int = 3,
460
+ timeout: int = 300,
461
  ) -> dict | None:
462
+ """Run a single lm_eval evaluation via the Python API and return results.
463
+
464
+ ``max_retries`` and ``timeout`` are forwarded into lm-eval's TemplateAPI
465
+ via ``model_args``. Defaults match lm-eval's own stock values; the TOML
466
+ ``[defaults]`` block typically overrides them for Functionary endpoints
467
+ (see eval_config.toml for the rationale).
468
+ """
469
  _ensure_lm_eval_api_key()
470
+ model_args = (
471
+ f"model={model_id},base_url={base_url},num_concurrent={num_concurrent},"
472
+ f"max_retries={max_retries},timeout={timeout}"
473
+ )
474
 
475
  tracker = _build_evaluation_tracker(output_path, hf_hub)
476
 
 
968
  if args.num_concurrent is not None
969
  else defaults.get("num_concurrent", 5)
970
  )
971
+ # Resilience knobs read from [defaults]; see eval_config.toml.
972
+ # Stock lm-eval defaults (3 / 300) are kept as fallbacks so the
973
+ # behavior is unchanged when the TOML doesn't override them.
974
+ max_retries = int(defaults.get("max_retries", 3))
975
+ timeout = int(defaults.get("timeout", 300))
976
 
977
  eval_kwargs = dict(
978
  include_path=SCRIPT_DIR,
 
992
  gen_kwargs=gen_kwargs,
993
  hf_hub=hf_hub,
994
  endpoint_kind=model.get("endpoint_kind", "chat_completions"),
995
+ max_retries=max_retries,
996
+ timeout=timeout,
997
  )
998
 
999
  header = f"[{run_idx}/{total}] {model['name']} x {task['name']}"
evals/ukrainian/README.md CHANGED
@@ -7,15 +7,14 @@ Ukrainian (`ukr_Cyrl` / macro `uk`) evaluation suite for the
7
 
8
  ### Custom Tasks (require `--include_path`)
9
 
10
- | # | Task Name | Category | Dataset (HuggingFace) | Metric |
11
- | --- | ---------------------------- | ----------------- | -------------------------------------------------------------- | ------------------ |
12
- | 1 | `ukrainian_sib200` | Classification | `Davlan/sib200` (`ukr_Cyrl`) | f1_macro |
13
- | 2 | `ukrainian_belebele` | MCQ | `facebook/belebele` (`ukr_Cyrl`) | f1_macro |
14
- | 3 | `ukrainian_global_mmlu` | MCQ | `CohereLabs/Global-MMLU` (`uk`, filtered to CS+CA, ~400 items) | f1_macro |
15
- | 4 | `ukrainian_zno` | MCQ (native exam) | `osyvokon/zno` | f1_macro |
16
- | 5 | `ukrainian_squad` | Extractive QA | `HPLT/ua-squad` (revised UA-SQuAD) | exact_match + f1 |
17
- | 6 | `ukrainian_massivesumm_long` | Summarization | `MaLA-LM/MassiveSumm_long` (filtered `language=ukr`) | rouge_l |
18
- | 7 | `ukrainian_polywrite` | Open generation | `MaLA-LM/PolyWrite` (filtered `lang_script=ukr_Cyrl`) | open_quality_score |
19
 
20
  #### Subgroups
21
 
@@ -24,7 +23,6 @@ Ukrainian (`ukr_Cyrl` / macro `uk`) evaluation suite for the
24
  | `ukrainian_classification` | sib200 |
25
  | `ukrainian_mcq` | belebele, global_mmlu, zno |
26
  | `ukrainian_qa` | squad |
27
- | `ukrainian_summarization` | massivesumm_long |
28
  | `ukrainian_open_generation` | polywrite |
29
 
30
  ## Setup
@@ -41,7 +39,7 @@ All commands must be run from the `multilingual_bench/` directory:
41
  cd /path/to/functionary_internal/evaluation/multilingual_bench
42
  ```
43
 
44
- ### Run the Entire Ukrainian Suite (all 7 tasks)
45
 
46
  ```bash
47
  OPENAI_API_KEY="$OPENROUTER_API_KEY" \
@@ -70,7 +68,6 @@ python run_eval.py --models gpt-5-mini --tasks ukrainian
70
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_classification ...
71
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_mcq ...
72
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_qa ...
73
- lm_eval --include_path lm_eval_tasks --tasks ukrainian_summarization ...
74
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_open_generation ...
75
  ```
76
 
@@ -82,7 +79,6 @@ lm_eval --include_path lm_eval_tasks --tasks ukrainian_belebele ...
82
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_global_mmlu ...
83
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_zno ...
84
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_squad ...
85
- lm_eval --include_path lm_eval_tasks --tasks ukrainian_massivesumm_long ...
86
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_polywrite ...
87
  ```
88
 
@@ -95,26 +91,14 @@ With `--log_samples`, the output directory contains:
95
 
96
  ## Dataset Sources
97
 
98
- | Dataset | Source | Config | Notes |
99
- | ---------------- | -------------------------- | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
100
- | SIB-200 | `Davlan/sib200` | `ukr_Cyrl` | text + ClassLabel `category` (7 topics) |
101
- | Belebele | `facebook/belebele` | `ukr_Cyrl` | flores_passage + question + 4 mc_answers, `correct_answer_num` 1-4 |
102
- | Global-MMLU | `CohereLabs/Global-MMLU` | `uk` | Loaded from the **full** Global-MMLU (Lite doesn't ship `uk`) and filtered in `process_global_mmlu_docs` to `cultural_sensitivity_label ∈ {CS, CA}`, giving ~400 culturally-annotated items — the same subset Lite ships for other languages. Schema: `question`, `option_a..d`, `answer` letter, `subject`, `cultural_sensitivity_label`. |
103
- | ZNO | `osyvokon/zno` | — | Native Ukrainian high-school exam (Ukrainian language & literature + history of Ukraine). 751 test items (2020-2023). Cyrillic markers А/Б/В/Г/Д, gold in `correct_answers[0]` |
104
- | UA-SQuAD | `HPLT/ua-squad` | — | Native Ukrainian extractive QA (revised UA-SQuAD). 7,729 rows, ~50/50 `train`/`test` (we use `test`). Standard SQuAD schema: `id`, `context`, `question`, `answers.{text, answer_start}`. MIT license. |
105
- | MassiveSumm long | `MaLA-LM/MassiveSumm_long` | — (filter `language=ukr`) | `text`, `summary`, `language`; longer articles; gated |
106
- | PolyWrite | `MaLA-LM/PolyWrite` | — (filter `lang_script=ukr_Cyrl`) | `prompt_translated`, `category`, `lang_script` (no reference answer) |
107
-
108
- ### Gated datasets
109
-
110
- The MassiveSumm dataset is gated on Hugging Face. Accept the terms (once) and export an HF token before running:
111
-
112
- - MassiveSumm long: <https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long>
113
-
114
- ```bash
115
- export HF_TOKEN="hf_..."
116
- huggingface-cli login # one-time, optional if HF_TOKEN is exported
117
- ```
118
 
119
  ### LLM-judge tasks
120
 
 
7
 
8
  ### Custom Tasks (require `--include_path`)
9
 
10
+ | # | Task Name | Category | Dataset (HuggingFace) | Metric |
11
+ | --- | ----------------------- | ----------------- | -------------------------------------------------------------- | ------------------ |
12
+ | 1 | `ukrainian_sib200` | Classification | `Davlan/sib200` (`ukr_Cyrl`) | f1_macro |
13
+ | 2 | `ukrainian_belebele` | MCQ | `facebook/belebele` (`ukr_Cyrl`) | f1_macro |
14
+ | 3 | `ukrainian_global_mmlu` | MCQ | `CohereLabs/Global-MMLU` (`uk`, filtered to CS+CA, ~400 items) | f1_macro |
15
+ | 4 | `ukrainian_zno` | MCQ (native exam) | `osyvokon/zno` | f1_macro |
16
+ | 5 | `ukrainian_squad` | Extractive QA | `HPLT/ua-squad` (revised UA-SQuAD) | exact_match + f1 |
17
+ | 6 | `ukrainian_polywrite` | Open generation | `MaLA-LM/PolyWrite` (filtered `lang_script=ukr_Cyrl`) | open_quality_score |
 
18
 
19
  #### Subgroups
20
 
 
23
  | `ukrainian_classification` | sib200 |
24
  | `ukrainian_mcq` | belebele, global_mmlu, zno |
25
  | `ukrainian_qa` | squad |
 
26
  | `ukrainian_open_generation` | polywrite |
27
 
28
  ## Setup
 
39
  cd /path/to/functionary_internal/evaluation/multilingual_bench
40
  ```
41
 
42
+ ### Run the Entire Ukrainian Suite (all 6 tasks)
43
 
44
  ```bash
45
  OPENAI_API_KEY="$OPENROUTER_API_KEY" \
 
68
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_classification ...
69
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_mcq ...
70
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_qa ...
 
71
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_open_generation ...
72
  ```
73
 
 
79
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_global_mmlu ...
80
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_zno ...
81
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_squad ...
 
82
  lm_eval --include_path lm_eval_tasks --tasks ukrainian_polywrite ...
83
  ```
84
 
 
91
 
92
  ## Dataset Sources
93
 
94
+ | Dataset | Source | Config | Notes |
95
+ | ----------- | ------------------------ | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
96
+ | SIB-200 | `Davlan/sib200` | `ukr_Cyrl` | text + ClassLabel `category` (7 topics) |
97
+ | Belebele | `facebook/belebele` | `ukr_Cyrl` | flores_passage + question + 4 mc_answers, `correct_answer_num` 1-4 |
98
+ | Global-MMLU | `CohereLabs/Global-MMLU` | `uk` | Loaded from the **full** Global-MMLU (Lite doesn't ship `uk`) and filtered in `process_global_mmlu_docs` to `cultural_sensitivity_label ∈ {CS, CA}`, giving ~400 culturally-annotated items — the same subset Lite ships for other languages. Schema: `question`, `option_a..d`, `answer` letter, `subject`, `cultural_sensitivity_label`. |
99
+ | ZNO | `osyvokon/zno` | — | Native Ukrainian high-school exam (Ukrainian language & literature + history of Ukraine). 751 test items (2020-2023). Cyrillic markers А/Б/В/Г/Д, gold in `correct_answers[0]` |
100
+ | UA-SQuAD | `HPLT/ua-squad` | — | Native Ukrainian extractive QA (revised UA-SQuAD). 7,729 rows, ~50/50 `train`/`test` (we use `test`). Standard SQuAD schema: `id`, `context`, `question`, `answers.{text, answer_start}`. MIT license. |
101
+ | PolyWrite | `MaLA-LM/PolyWrite` | — (filter `lang_script=ukr_Cyrl`) | `prompt_translated`, `category`, `lang_script` (no reference answer) |
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  ### LLM-judge tasks
104
 
evals/ukrainian/summarization/_default_summarization_yaml DELETED
@@ -1,18 +0,0 @@
1
- # Shared config for Ukrainian summarization tasks (MassiveSumm).
2
- # The Ukrainian-only filter is applied in process_docs (the upstream
3
- # datasets are highly multilingual single-table dumps, no per-language
4
- # config). Scoring is sentence-level ROUGE-L F1.
5
- output_type: generate_until
6
- test_split: train
7
- generation_kwargs:
8
- do_sample: false
9
- max_gen_toks: 8192
10
- until:
11
- - "<|endoftext|>"
12
- process_results: !function utils.process_results
13
- metric_list:
14
- - metric: rouge_l
15
- aggregation: !function utils.rouge_l_agg
16
- higher_is_better: true
17
- metadata:
18
- version: 1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/ukrainian/summarization/ukrainian_massivesumm_long.yaml DELETED
@@ -1,14 +0,0 @@
1
- task: ukrainian_massivesumm_long
2
- task_alias: massivesumm_long
3
- # Gated dataset: accept terms at
4
- # https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long and export HF_TOKEN.
5
- dataset_path: MaLA-LM/MassiveSumm_long
6
- include: _default_summarization_yaml
7
- process_docs: !function utils.process_docs
8
- generation_kwargs:
9
- do_sample: false
10
- max_gen_toks: 8192
11
- until:
12
- - "<|endoftext|>"
13
- doc_to_text: "Ти — система реферування новин.\nСтисни наведену нижче статтю в короткий абзац (3-5 речень) українською мовою. Не додавай коментарів.\n\nСтаття:\n{{text}}\n\nСтислий виклад:"
14
- doc_to_target: "{{summary}}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/ukrainian/summarization/ukrainian_summarization.yaml DELETED
@@ -1,10 +0,0 @@
1
- # Summarization subgroup (MassiveSumm long)
2
- group: ukrainian_summarization
3
- task:
4
- - ukrainian_massivesumm_long
5
- aggregate_metric_list:
6
- - metric: rouge_l
7
- aggregation: mean
8
- weight_by_size: true
9
- metadata:
10
- version: 1.0
 
 
 
 
 
 
 
 
 
 
 
evals/ukrainian/summarization/utils.py DELETED
@@ -1,111 +0,0 @@
1
- """Utility helpers for Ukrainian summarization tasks (MassiveSumm short/long).
2
-
3
- Both MassiveSumm subsets are highly multilingual single-table datasets
4
- (one ``train`` split, no language configs). We filter to Ukrainian rows
5
- inside ``process_docs``. The HF dataset is **gated** — accept the terms
6
- on the dataset page once and export ``HF_TOKEN`` before running.
7
-
8
- Scoring uses ROUGE-L F1 via the ``rouge_score`` package, which is
9
- already a transitive dependency of lm-evaluation-harness.
10
- """
11
-
12
- from __future__ import annotations
13
-
14
- import re
15
- import string
16
-
17
- import datasets
18
-
19
-
20
- _UKRAINIAN_LANG_CODES = {"ukr", "uk"}
21
-
22
-
23
- def _strip_think_tags(text: str) -> str:
24
- """Strip <think>...</think> reasoning wrapper (e.g. Qwen thinking models)."""
25
- if "</think>" in text:
26
- return text.split("</think>")[-1].strip()
27
- return text
28
-
29
-
30
- def _filter_ukrainian(dataset: datasets.Dataset) -> datasets.Dataset:
31
- """Keep rows whose ``language`` field is one of Ukrainian variants."""
32
- if "language" not in dataset.column_names:
33
- return dataset
34
- return dataset.filter(lambda row: str(row.get("language", "")).lower() in _UKRAINIAN_LANG_CODES)
35
-
36
-
37
- def _normalise_doc(doc):
38
- """Project the columns we actually need."""
39
- text = (doc.get("text") or "").strip()
40
- summary = (doc.get("summary") or "").strip()
41
- title = (doc.get("title") or "").strip()
42
- return {
43
- "text": text,
44
- "summary": summary,
45
- "title": title,
46
- }
47
-
48
-
49
- def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
50
- filtered = _filter_ukrainian(dataset)
51
- return filtered.map(_normalise_doc, remove_columns=[
52
- c for c in filtered.column_names if c not in ("text", "summary", "title")
53
- ])
54
-
55
-
56
- # ── ROUGE-L scoring ──────────────────────────────────────────────────
57
-
58
-
59
- _PUNCT_TABLE = str.maketrans("", "", string.punctuation)
60
- _WHITESPACE_RE = re.compile(r"\s+")
61
-
62
-
63
- def _normalise(text: str) -> str:
64
- text = text.translate(_PUNCT_TABLE)
65
- text = _WHITESPACE_RE.sub(" ", text)
66
- return text.strip().lower()
67
-
68
-
69
- def _lcs_length(a, b):
70
- m, n = len(a), len(b)
71
- if m == 0 or n == 0:
72
- return 0
73
- dp = [0] * (n + 1)
74
- for i in range(1, m + 1):
75
- prev = 0
76
- for j in range(1, n + 1):
77
- tmp = dp[j]
78
- if a[i - 1] == b[j - 1]:
79
- dp[j] = prev + 1
80
- else:
81
- dp[j] = max(dp[j], dp[j - 1])
82
- prev = tmp
83
- return dp[n]
84
-
85
-
86
- def _rouge_l_f1(pred: str, gold: str) -> float:
87
- """Compute sentence-level ROUGE-L F1 (no stemming) between pred and gold."""
88
- pred_tokens = _normalise(pred).split()
89
- gold_tokens = _normalise(gold).split()
90
- if not pred_tokens or not gold_tokens:
91
- return 0.0
92
- lcs = _lcs_length(pred_tokens, gold_tokens)
93
- if lcs == 0:
94
- return 0.0
95
- precision = lcs / len(pred_tokens)
96
- recall = lcs / len(gold_tokens)
97
- return 2 * precision * recall / (precision + recall)
98
-
99
-
100
- def process_results(doc, results):
101
- raw_response = results[0].strip() if results and results[0] else ""
102
- pred = _strip_think_tags(raw_response)
103
- gold = (doc.get("summary") or "").strip()
104
- return {"rouge_l": (gold, pred)}
105
-
106
-
107
- def rouge_l_agg(items):
108
- if not items:
109
- return 0.0
110
- scores = [_rouge_l_f1(pred, gold) for gold, pred in items]
111
- return sum(scores) / len(scores)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evals/ukrainian/ukrainian.yaml CHANGED
@@ -10,14 +10,12 @@
10
  # Metrics:
11
  # classification & mcq → f1_macro (per sub-group)
12
  # qa → exact_match + f1 (per sub-group)
13
- # summarization → rouge_l (per sub-group)
14
  # open_generation → llm_judge_score / open_quality_score (per sub-group)
15
  group: ukrainian
16
  task:
17
  - ukrainian_classification
18
  - ukrainian_mcq
19
  - ukrainian_qa
20
- - ukrainian_summarization
21
  - ukrainian_open_generation
22
  metadata:
23
  version: 1.0
 
10
  # Metrics:
11
  # classification & mcq → f1_macro (per sub-group)
12
  # qa → exact_match + f1 (per sub-group)
 
13
  # open_generation → llm_judge_score / open_quality_score (per sub-group)
14
  group: ukrainian
15
  task:
16
  - ukrainian_classification
17
  - ukrainian_mcq
18
  - ukrainian_qa
 
19
  - ukrainian_open_generation
20
  metadata:
21
  version: 1.0