Spaces:
Runtime error
Runtime error
Mirror evals/ from c1978f83e59e
Browse files- evals/albanian/README.md +21 -30
- evals/albanian/albanian.yaml +0 -2
- evals/albanian/summarization/_default_summarization_yaml +0 -18
- evals/albanian/summarization/albanian_massivesumm_long.yaml +0 -14
- evals/albanian/summarization/albanian_massivesumm_short.yaml +0 -9
- evals/albanian/summarization/albanian_summarization.yaml +0 -11
- evals/albanian/summarization/utils.py +0 -111
- evals/eval_config.toml +11 -9
- evals/portuguese/README.md +15 -27
- evals/portuguese/nli/portuguese_assin2_rte.yaml +0 -28
- evals/portuguese/nli/portuguese_assin2_sts.yaml +0 -22
- evals/portuguese/nli/portuguese_faquad_nli.yaml +0 -28
- evals/portuguese/nli/portuguese_nli.yaml +0 -18
- evals/portuguese/nli/utils.py +0 -102
- evals/portuguese/portuguese.yaml +0 -2
- evals/run_eval.py +20 -2
- evals/ukrainian/README.md +17 -33
- evals/ukrainian/summarization/_default_summarization_yaml +0 -18
- evals/ukrainian/summarization/ukrainian_massivesumm_long.yaml +0 -14
- evals/ukrainian/summarization/ukrainian_summarization.yaml +0 -10
- evals/ukrainian/summarization/utils.py +0 -111
- evals/ukrainian/ukrainian.yaml +0 -2
evals/albanian/README.md
CHANGED
|
@@ -7,24 +7,21 @@ Albanian (Tosk, `als_Latn` / macro `sq`) evaluation suite for the
|
|
| 7 |
|
| 8 |
### Custom Tasks (require `--include_path`)
|
| 9 |
|
| 10 |
-
| # | Task Name
|
| 11 |
-
| --- | ----------------------
|
| 12 |
-
| 1 | `albanian_sib200`
|
| 13 |
-
| 2 | `albanian_belebele`
|
| 14 |
-
| 3 | `albanian_global_mmlu`
|
| 15 |
-
| 4 | `
|
| 16 |
-
| 5 | `
|
| 17 |
-
| 6 | `albanian_aya` | Open generation | `CohereLabs/aya_evaluation_suite` (`dolly_machine_translated`, filtered `language=sqi`) | llm_judge_score |
|
| 18 |
-
| 7 | `albanian_polywrite` | Open generation | `MaLA-LM/PolyWrite` (filtered `lang_script=sqi_Latn`) | open_quality_score |
|
| 19 |
|
| 20 |
#### Subgroups
|
| 21 |
|
| 22 |
-
| Group | Tasks
|
| 23 |
-
| -------------------------- | ---------------------
|
| 24 |
-
| `albanian_classification` | sib200
|
| 25 |
-
| `albanian_mcq` | belebele, global_mmlu
|
| 26 |
-
| `
|
| 27 |
-
| `albanian_open_generation` | aya, polywrite |
|
| 28 |
|
| 29 |
## Setup
|
| 30 |
|
|
@@ -40,7 +37,7 @@ All commands must be run from the `multilingual_bench/` directory:
|
|
| 40 |
cd /path/to/functionary_internal/evaluation/multilingual_bench
|
| 41 |
```
|
| 42 |
|
| 43 |
-
### Run the Entire Albanian Suite (all
|
| 44 |
|
| 45 |
```bash
|
| 46 |
OPENAI_API_KEY="$OPENROUTER_API_KEY" \
|
|
@@ -68,7 +65,6 @@ python run_eval.py --models gpt-5-mini --tasks albanian
|
|
| 68 |
```bash
|
| 69 |
lm_eval --include_path lm_eval_tasks --tasks albanian_classification ...
|
| 70 |
lm_eval --include_path lm_eval_tasks --tasks albanian_mcq ...
|
| 71 |
-
lm_eval --include_path lm_eval_tasks --tasks albanian_summarization ...
|
| 72 |
lm_eval --include_path lm_eval_tasks --tasks albanian_open_generation ...
|
| 73 |
```
|
| 74 |
|
|
@@ -78,8 +74,6 @@ lm_eval --include_path lm_eval_tasks --tasks albanian_open_generation ...
|
|
| 78 |
lm_eval --include_path lm_eval_tasks --tasks albanian_sib200 ...
|
| 79 |
lm_eval --include_path lm_eval_tasks --tasks albanian_belebele ...
|
| 80 |
lm_eval --include_path lm_eval_tasks --tasks albanian_global_mmlu ...
|
| 81 |
-
lm_eval --include_path lm_eval_tasks --tasks albanian_massivesumm_short ...
|
| 82 |
-
lm_eval --include_path lm_eval_tasks --tasks albanian_massivesumm_long ...
|
| 83 |
lm_eval --include_path lm_eval_tasks --tasks albanian_aya ...
|
| 84 |
lm_eval --include_path lm_eval_tasks --tasks albanian_polywrite ...
|
| 85 |
```
|
|
@@ -93,22 +87,19 @@ With `--log_samples`, the output directory contains:
|
|
| 93 |
|
| 94 |
## Dataset Sources
|
| 95 |
|
| 96 |
-
| Dataset
|
| 97 |
-
| ----------------
|
| 98 |
-
| SIB-200
|
| 99 |
-
| Belebele
|
| 100 |
-
| Global-MMLU-Lite
|
| 101 |
-
|
|
| 102 |
-
|
|
| 103 |
-
| Aya Eval | `CohereLabs/aya_evaluation_suite` | `dolly_machine_translated` (filter `language=sqi`) | `inputs`, `targets`, `language`, `script` |
|
| 104 |
-
| PolyWrite | `MaLA-LM/PolyWrite` | — (filter `lang_script=sqi_Latn`) | `prompt_translated`, `category`, `lang_script` (no reference answer) |
|
| 105 |
|
| 106 |
### Gated datasets
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
- Aya Eval: <https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite>
|
| 111 |
-
- MassiveSumm short / long: <https://huggingface.co/datasets/MaLA-LM/MassiveSumm_short> and <https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long>
|
| 112 |
|
| 113 |
```bash
|
| 114 |
export HF_TOKEN="hf_..."
|
|
|
|
| 7 |
|
| 8 |
### Custom Tasks (require `--include_path`)
|
| 9 |
|
| 10 |
+
| # | Task Name | Category | Dataset (HuggingFace) | Metric |
|
| 11 |
+
| --- | ---------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------------ |
|
| 12 |
+
| 1 | `albanian_sib200` | Classification | `Davlan/sib200` (`als_Latn`) | f1_macro |
|
| 13 |
+
| 2 | `albanian_belebele` | MCQ | `facebook/belebele` (`als_Latn`) | f1_macro |
|
| 14 |
+
| 3 | `albanian_global_mmlu` | MCQ | `CohereLabs/Global-MMLU-Lite` (`sq`, v2) | f1_macro |
|
| 15 |
+
| 4 | `albanian_aya` | Open generation | `CohereLabs/aya_evaluation_suite` (`dolly_machine_translated`, filtered `language=sqi`) | llm_judge_score |
|
| 16 |
+
| 5 | `albanian_polywrite` | Open generation | `MaLA-LM/PolyWrite` (filtered `lang_script=sqi_Latn`) | open_quality_score |
|
|
|
|
|
|
|
| 17 |
|
| 18 |
#### Subgroups
|
| 19 |
|
| 20 |
+
| Group | Tasks |
|
| 21 |
+
| -------------------------- | --------------------- |
|
| 22 |
+
| `albanian_classification` | sib200 |
|
| 23 |
+
| `albanian_mcq` | belebele, global_mmlu |
|
| 24 |
+
| `albanian_open_generation` | aya, polywrite |
|
|
|
|
| 25 |
|
| 26 |
## Setup
|
| 27 |
|
|
|
|
| 37 |
cd /path/to/functionary_internal/evaluation/multilingual_bench
|
| 38 |
```
|
| 39 |
|
| 40 |
+
### Run the Entire Albanian Suite (all 5 tasks)
|
| 41 |
|
| 42 |
```bash
|
| 43 |
OPENAI_API_KEY="$OPENROUTER_API_KEY" \
|
|
|
|
| 65 |
```bash
|
| 66 |
lm_eval --include_path lm_eval_tasks --tasks albanian_classification ...
|
| 67 |
lm_eval --include_path lm_eval_tasks --tasks albanian_mcq ...
|
|
|
|
| 68 |
lm_eval --include_path lm_eval_tasks --tasks albanian_open_generation ...
|
| 69 |
```
|
| 70 |
|
|
|
|
| 74 |
lm_eval --include_path lm_eval_tasks --tasks albanian_sib200 ...
|
| 75 |
lm_eval --include_path lm_eval_tasks --tasks albanian_belebele ...
|
| 76 |
lm_eval --include_path lm_eval_tasks --tasks albanian_global_mmlu ...
|
|
|
|
|
|
|
| 77 |
lm_eval --include_path lm_eval_tasks --tasks albanian_aya ...
|
| 78 |
lm_eval --include_path lm_eval_tasks --tasks albanian_polywrite ...
|
| 79 |
```
|
|
|
|
| 87 |
|
| 88 |
## Dataset Sources
|
| 89 |
|
| 90 |
+
| Dataset | Source | Config | Notes |
|
| 91 |
+
| ---------------- | --------------------------------- | -------------------------------------------------- | -------------------------------------------------------------------- |
|
| 92 |
+
| SIB-200 | `Davlan/sib200` | `als_Latn` | text + ClassLabel `category` (7 topics) |
|
| 93 |
+
| Belebele | `facebook/belebele` | `als_Latn` | flores_passage + question + 4 mc_answers, `correct_answer_num` 1-4 |
|
| 94 |
+
| Global-MMLU-Lite | `CohereLabs/Global-MMLU-Lite` | `sq` | question + `option_a..d` + `answer` letter (400 samples, CS+CA) |
|
| 95 |
+
| Aya Eval | `CohereLabs/aya_evaluation_suite` | `dolly_machine_translated` (filter `language=sqi`) | `inputs`, `targets`, `language`, `script` |
|
| 96 |
+
| PolyWrite | `MaLA-LM/PolyWrite` | — (filter `lang_script=sqi_Latn`) | `prompt_translated`, `category`, `lang_script` (no reference answer) |
|
|
|
|
|
|
|
| 97 |
|
| 98 |
### Gated datasets
|
| 99 |
|
| 100 |
+
The Aya Eval dataset is gated on Hugging Face. Accept the terms (once) and export an HF token before running:
|
| 101 |
|
| 102 |
- Aya Eval: <https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite>
|
|
|
|
| 103 |
|
| 104 |
```bash
|
| 105 |
export HF_TOKEN="hf_..."
|
evals/albanian/albanian.yaml
CHANGED
|
@@ -9,13 +9,11 @@
|
|
| 9 |
#
|
| 10 |
# Metrics:
|
| 11 |
# classification & mcq → f1_macro (per sub-group)
|
| 12 |
-
# summarization → rouge_l (per sub-group)
|
| 13 |
# open_generation → llm_judge_score / open_quality_score (per sub-group)
|
| 14 |
group: albanian
|
| 15 |
task:
|
| 16 |
- albanian_classification
|
| 17 |
- albanian_mcq
|
| 18 |
-
- albanian_summarization
|
| 19 |
- albanian_open_generation
|
| 20 |
metadata:
|
| 21 |
version: 1.0
|
|
|
|
| 9 |
#
|
| 10 |
# Metrics:
|
| 11 |
# classification & mcq → f1_macro (per sub-group)
|
|
|
|
| 12 |
# open_generation → llm_judge_score / open_quality_score (per sub-group)
|
| 13 |
group: albanian
|
| 14 |
task:
|
| 15 |
- albanian_classification
|
| 16 |
- albanian_mcq
|
|
|
|
| 17 |
- albanian_open_generation
|
| 18 |
metadata:
|
| 19 |
version: 1.0
|
evals/albanian/summarization/_default_summarization_yaml
DELETED
|
@@ -1,18 +0,0 @@
|
|
| 1 |
-
# Shared config for Albanian summarization tasks (MassiveSumm).
|
| 2 |
-
# The Albanian-only filter is applied in process_docs (the upstream
|
| 3 |
-
# datasets are highly multilingual single-table dumps, no per-language
|
| 4 |
-
# config). Scoring is sentence-level ROUGE-L F1.
|
| 5 |
-
output_type: generate_until
|
| 6 |
-
test_split: train
|
| 7 |
-
generation_kwargs:
|
| 8 |
-
do_sample: false
|
| 9 |
-
max_gen_toks: 8192
|
| 10 |
-
until:
|
| 11 |
-
- "<|endoftext|>"
|
| 12 |
-
process_results: !function utils.process_results
|
| 13 |
-
metric_list:
|
| 14 |
-
- metric: rouge_l
|
| 15 |
-
aggregation: !function utils.rouge_l_agg
|
| 16 |
-
higher_is_better: true
|
| 17 |
-
metadata:
|
| 18 |
-
version: 1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/albanian/summarization/albanian_massivesumm_long.yaml
DELETED
|
@@ -1,14 +0,0 @@
|
|
| 1 |
-
task: albanian_massivesumm_long
|
| 2 |
-
task_alias: massivesumm_long
|
| 3 |
-
# Gated dataset: accept terms at
|
| 4 |
-
# https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long and export HF_TOKEN.
|
| 5 |
-
dataset_path: MaLA-LM/MassiveSumm_long
|
| 6 |
-
include: _default_summarization_yaml
|
| 7 |
-
process_docs: !function utils.process_docs
|
| 8 |
-
generation_kwargs:
|
| 9 |
-
do_sample: false
|
| 10 |
-
max_gen_toks: 8192
|
| 11 |
-
until:
|
| 12 |
-
- "<|endoftext|>"
|
| 13 |
-
doc_to_text: "Ti je një sistem përmbledhjeje lajmesh.\nPërmblidh artikullin e mëposhtëm në një paragraf të shkurtër (3-5 fjali) në gjuhën shqipe. Mos shto komente.\n\nArtikulli:\n{{text}}\n\nPërmbledhja:"
|
| 14 |
-
doc_to_target: "{{summary}}"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/albanian/summarization/albanian_massivesumm_short.yaml
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
task: albanian_massivesumm_short
|
| 2 |
-
task_alias: massivesumm_short
|
| 3 |
-
# Gated dataset: accept terms at
|
| 4 |
-
# https://huggingface.co/datasets/MaLA-LM/MassiveSumm_short and export HF_TOKEN.
|
| 5 |
-
dataset_path: MaLA-LM/MassiveSumm_short
|
| 6 |
-
include: _default_summarization_yaml
|
| 7 |
-
process_docs: !function utils.process_docs
|
| 8 |
-
doc_to_text: "Ti je një sistem përmbledhjeje lajmesh.\nPërmblidh artikullin e mëposhtëm në një ose dy fjali të shkurtra në gjuhën shqipe. Mos shto komente.\n\nArtikulli:\n{{text}}\n\nPërmbledhja:"
|
| 9 |
-
doc_to_target: "{{summary}}"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/albanian/summarization/albanian_summarization.yaml
DELETED
|
@@ -1,11 +0,0 @@
|
|
| 1 |
-
# Summarization subgroup (MassiveSumm short + long)
|
| 2 |
-
group: albanian_summarization
|
| 3 |
-
task:
|
| 4 |
-
- albanian_massivesumm_short
|
| 5 |
-
- albanian_massivesumm_long
|
| 6 |
-
aggregate_metric_list:
|
| 7 |
-
- metric: rouge_l
|
| 8 |
-
aggregation: mean
|
| 9 |
-
weight_by_size: true
|
| 10 |
-
metadata:
|
| 11 |
-
version: 1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/albanian/summarization/utils.py
DELETED
|
@@ -1,111 +0,0 @@
|
|
| 1 |
-
"""Utility helpers for Albanian summarization tasks (MassiveSumm short/long).
|
| 2 |
-
|
| 3 |
-
Both MassiveSumm subsets are highly multilingual single-table datasets
|
| 4 |
-
(one ``train`` split, no language configs). We filter to Albanian rows
|
| 5 |
-
inside ``process_docs``. The HF dataset is **gated** — accept the terms
|
| 6 |
-
on the dataset page once and export ``HF_TOKEN`` before running.
|
| 7 |
-
|
| 8 |
-
Scoring uses ROUGE-L F1 via the ``rouge_score`` package, which is
|
| 9 |
-
already a transitive dependency of lm-evaluation-harness.
|
| 10 |
-
"""
|
| 11 |
-
|
| 12 |
-
from __future__ import annotations
|
| 13 |
-
|
| 14 |
-
import re
|
| 15 |
-
import string
|
| 16 |
-
|
| 17 |
-
import datasets
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
_ALBANIAN_LANG_CODES = {"sqi", "als", "aln"}
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
def _strip_think_tags(text: str) -> str:
|
| 24 |
-
"""Strip <think>...</think> reasoning wrapper (e.g. Qwen thinking models)."""
|
| 25 |
-
if "</think>" in text:
|
| 26 |
-
return text.split("</think>")[-1].strip()
|
| 27 |
-
return text
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
def _filter_albanian(dataset: datasets.Dataset) -> datasets.Dataset:
|
| 31 |
-
"""Keep rows whose ``language`` field is one of Albanian variants."""
|
| 32 |
-
if "language" not in dataset.column_names:
|
| 33 |
-
return dataset
|
| 34 |
-
return dataset.filter(lambda row: str(row.get("language", "")).lower() in _ALBANIAN_LANG_CODES)
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
def _normalise_doc(doc):
|
| 38 |
-
"""Project the columns we actually need."""
|
| 39 |
-
text = (doc.get("text") or "").strip()
|
| 40 |
-
summary = (doc.get("summary") or "").strip()
|
| 41 |
-
title = (doc.get("title") or "").strip()
|
| 42 |
-
return {
|
| 43 |
-
"text": text,
|
| 44 |
-
"summary": summary,
|
| 45 |
-
"title": title,
|
| 46 |
-
}
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
|
| 50 |
-
filtered = _filter_albanian(dataset)
|
| 51 |
-
return filtered.map(_normalise_doc, remove_columns=[
|
| 52 |
-
c for c in filtered.column_names if c not in ("text", "summary", "title")
|
| 53 |
-
])
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
# ── ROUGE-L scoring ──────────────────────────────────────────────────
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
_PUNCT_TABLE = str.maketrans("", "", string.punctuation)
|
| 60 |
-
_WHITESPACE_RE = re.compile(r"\s+")
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
def _normalise(text: str) -> str:
|
| 64 |
-
text = text.translate(_PUNCT_TABLE)
|
| 65 |
-
text = _WHITESPACE_RE.sub(" ", text)
|
| 66 |
-
return text.strip().lower()
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
def _lcs_length(a, b):
|
| 70 |
-
m, n = len(a), len(b)
|
| 71 |
-
if m == 0 or n == 0:
|
| 72 |
-
return 0
|
| 73 |
-
dp = [0] * (n + 1)
|
| 74 |
-
for i in range(1, m + 1):
|
| 75 |
-
prev = 0
|
| 76 |
-
for j in range(1, n + 1):
|
| 77 |
-
tmp = dp[j]
|
| 78 |
-
if a[i - 1] == b[j - 1]:
|
| 79 |
-
dp[j] = prev + 1
|
| 80 |
-
else:
|
| 81 |
-
dp[j] = max(dp[j], dp[j - 1])
|
| 82 |
-
prev = tmp
|
| 83 |
-
return dp[n]
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
def _rouge_l_f1(pred: str, gold: str) -> float:
|
| 87 |
-
"""Compute sentence-level ROUGE-L F1 (no stemming) between pred and gold."""
|
| 88 |
-
pred_tokens = _normalise(pred).split()
|
| 89 |
-
gold_tokens = _normalise(gold).split()
|
| 90 |
-
if not pred_tokens or not gold_tokens:
|
| 91 |
-
return 0.0
|
| 92 |
-
lcs = _lcs_length(pred_tokens, gold_tokens)
|
| 93 |
-
if lcs == 0:
|
| 94 |
-
return 0.0
|
| 95 |
-
precision = lcs / len(pred_tokens)
|
| 96 |
-
recall = lcs / len(gold_tokens)
|
| 97 |
-
return 2 * precision * recall / (precision + recall)
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
def process_results(doc, results):
|
| 101 |
-
raw_response = results[0].strip() if results and results[0] else ""
|
| 102 |
-
pred = _strip_think_tags(raw_response)
|
| 103 |
-
gold = (doc.get("summary") or "").strip()
|
| 104 |
-
return {"rouge_l": (gold, pred)}
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
def rouge_l_agg(items):
|
| 108 |
-
if not items:
|
| 109 |
-
return 0.0
|
| 110 |
-
scores = [_rouge_l_f1(pred, gold) for gold, pred in items]
|
| 111 |
-
return sum(scores) / len(scores)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/eval_config.toml
CHANGED
|
@@ -29,6 +29,17 @@ apply_chat_template = true
|
|
| 29 |
log_samples = true
|
| 30 |
output_path = "output/results"
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
# ── Hugging Face Hub ─────────────────────────────────────────────────
|
| 33 |
# Token: export HF_TOKEN="hf_..." (https://huggingface.co/settings/tokens)
|
| 34 |
#
|
|
@@ -161,9 +172,6 @@ name = "albanian_classification"
|
|
| 161 |
[[tasks]]
|
| 162 |
name = "albanian_mcq"
|
| 163 |
|
| 164 |
-
[[tasks]]
|
| 165 |
-
name = "albanian_summarization"
|
| 166 |
-
|
| 167 |
[[tasks]]
|
| 168 |
name = "albanian_open_generation"
|
| 169 |
|
|
@@ -174,9 +182,6 @@ name = "portuguese_mcq"
|
|
| 174 |
[[tasks]]
|
| 175 |
name = "portuguese_classification"
|
| 176 |
|
| 177 |
-
[[tasks]]
|
| 178 |
-
name = "portuguese_nli"
|
| 179 |
-
|
| 180 |
# Ukrainian
|
| 181 |
[[tasks]]
|
| 182 |
name = "ukrainian_classification"
|
|
@@ -187,9 +192,6 @@ name = "ukrainian_mcq"
|
|
| 187 |
[[tasks]]
|
| 188 |
name = "ukrainian_qa"
|
| 189 |
|
| 190 |
-
[[tasks]]
|
| 191 |
-
name = "ukrainian_summarization"
|
| 192 |
-
|
| 193 |
[[tasks]]
|
| 194 |
name = "ukrainian_open_generation"
|
| 195 |
|
|
|
|
| 29 |
log_samples = true
|
| 30 |
output_path = "output/results"
|
| 31 |
|
| 32 |
+
# Resilience knobs forwarded to lm_eval's TemplateAPI (local-chat-completions):
|
| 33 |
+
# max_retries → tenacity stop_after_attempt(N) per request.
|
| 34 |
+
# timeout → aiohttp ClientTimeout(total=SEC) per request.
|
| 35 |
+
# Default lm_eval values (3 / 300) are too low for long CoT tasks against an
|
| 36 |
+
# overloaded SGLang / Functionary endpoint — slow tail requests or server
|
| 37 |
+
# hiccups (ServerDisconnectedError, ConnectionReset, Cloudflare 524) exhaust
|
| 38 |
+
# retries and abort the run. Mirrors functionary_internal's working config:
|
| 39 |
+
# https://github.com/MeetKai/functionary_internal/blob/main/functionary_internal/evaluation/multilingual_bench/lm_eval_tasks/eval_config.toml
|
| 40 |
+
max_retries = 5
|
| 41 |
+
timeout = 1800
|
| 42 |
+
|
| 43 |
# ── Hugging Face Hub ─────────────────────────────────────────────────
|
| 44 |
# Token: export HF_TOKEN="hf_..." (https://huggingface.co/settings/tokens)
|
| 45 |
#
|
|
|
|
| 172 |
[[tasks]]
|
| 173 |
name = "albanian_mcq"
|
| 174 |
|
|
|
|
|
|
|
|
|
|
| 175 |
[[tasks]]
|
| 176 |
name = "albanian_open_generation"
|
| 177 |
|
|
|
|
| 182 |
[[tasks]]
|
| 183 |
name = "portuguese_classification"
|
| 184 |
|
|
|
|
|
|
|
|
|
|
| 185 |
# Ukrainian
|
| 186 |
[[tasks]]
|
| 187 |
name = "ukrainian_classification"
|
|
|
|
| 192 |
[[tasks]]
|
| 193 |
name = "ukrainian_qa"
|
| 194 |
|
|
|
|
|
|
|
|
|
|
| 195 |
[[tasks]]
|
| 196 |
name = "ukrainian_open_generation"
|
| 197 |
|
evals/portuguese/README.md
CHANGED
|
@@ -12,17 +12,13 @@ Portuguese (PT-BR) evaluation suite for the `lm-evaluation-harness` framework.
|
|
| 12 |
| 4 | `portuguese_hatebr` | Classification | `eduagarcia/portuguese_benchmark` (HateBR) | f1_macro |
|
| 13 |
| 5 | `portuguese_hate_speech` | Classification | `eduagarcia/portuguese_benchmark` (Hate Speech) | f1_macro |
|
| 14 |
| 6 | `portuguese_tweetsentbr` | Classification | `eduagarcia/tweetsentbr_fewshot` | f1_macro |
|
| 15 |
-
| 7 | `portuguese_assin2_rte` | NLI | `assin2` | f1_macro |
|
| 16 |
-
| 8 | `portuguese_faquad_nli` | NLI | `ruanchaves/faquad-nli` | f1_macro |
|
| 17 |
-
| 9 | `portuguese_assin2_sts` | NLI | `assin2` | pearson |
|
| 18 |
|
| 19 |
### Subgroups
|
| 20 |
|
| 21 |
-
| Group | Tasks
|
| 22 |
-
| --------------------------- | --------------------------------
|
| 23 |
-
| `portuguese_mcq` | enem, bluex, oab_exams
|
| 24 |
-
| `portuguese_classification` | hatebr, hate_speech, tweetsentbr
|
| 25 |
-
| `portuguese_nli` | assin2_rte, faquad_nli, assin2_sts |
|
| 26 |
|
| 27 |
## Setup
|
| 28 |
|
|
@@ -39,7 +35,7 @@ cd functionary_internal/evaluation/multilingual_bench/lm_eval_tasks
|
|
| 39 |
export INCLUDE_PATH="$(pwd)"
|
| 40 |
```
|
| 41 |
|
| 42 |
-
### Run the Entire Portuguese (all
|
| 43 |
|
| 44 |
```bash
|
| 45 |
OPENAI_API_KEY="your-key" \
|
|
@@ -63,9 +59,6 @@ lm_eval --include_path $INCLUDE_PATH --tasks portuguese_mcq ...
|
|
| 63 |
|
| 64 |
# Classification (hate speech, sentiment)
|
| 65 |
lm_eval --include_path $INCLUDE_PATH --tasks portuguese_classification ...
|
| 66 |
-
|
| 67 |
-
# Natural Language Inference (ASSIN2 RTE + FaQuAD NLI + ASSIN2 STS)
|
| 68 |
-
lm_eval --include_path $INCLUDE_PATH --tasks portuguese_nli ...
|
| 69 |
```
|
| 70 |
|
| 71 |
### Run a Single Task
|
|
@@ -85,9 +78,6 @@ lm_eval --include_path $INCLUDE_PATH --tasks portuguese_hatebr ...
|
|
| 85 |
|
| 86 |
# Sentiment analysis
|
| 87 |
lm_eval --include_path $INCLUDE_PATH --tasks portuguese_tweetsentbr ...
|
| 88 |
-
|
| 89 |
-
# Textual entailment
|
| 90 |
-
lm_eval --include_path $INCLUDE_PATH --tasks portuguese_assin2_rte ...
|
| 91 |
```
|
| 92 |
|
| 93 |
### Run with a Local HuggingFace Model
|
|
@@ -106,8 +96,8 @@ lm_eval \
|
|
| 106 |
### Mix and Match
|
| 107 |
|
| 108 |
```bash
|
| 109 |
-
# Run ENEM +
|
| 110 |
-
lm_eval --include_path $INCLUDE_PATH --tasks portuguese_enem,
|
| 111 |
```
|
| 112 |
|
| 113 |
## Output
|
|
@@ -119,13 +109,11 @@ With `--log_samples`, the output directory contains:
|
|
| 119 |
|
| 120 |
## Dataset Sources
|
| 121 |
|
| 122 |
-
| Dataset | Source | Config | Fields
|
| 123 |
-
| ---------------------- | -------------------------------------- | ------------------------------- | ----------------------------
|
| 124 |
-
| ENEM | `eduagarcia/enem_challenge` | — | question, choices, answerKey
|
| 125 |
-
| BLUEX | `eduagarcia-temp/BLUEX_without_images` | — | question, choices, answerKey
|
| 126 |
-
| OAB Exams | `eduagarcia/oab_exams` | — | question, choices, answerKey
|
| 127 |
-
| HateBR | `eduagarcia/portuguese_benchmark` | `HateBR_offensive_binary` | sentence, label
|
| 128 |
-
| Portuguese Hate Speech | `eduagarcia/portuguese_benchmark` | `Portuguese_Hate_Speech_binary` | sentence, label
|
| 129 |
-
| TweetSentBR | `eduagarcia/tweetsentbr_fewshot` | — | sentence, label
|
| 130 |
-
| ASSIN2 | `assin2` | — | premise, hypothesis, entailment_judgment, relatedness_score |
|
| 131 |
-
| FaQuAD-NLI | `ruanchaves/faquad-nli` | — | question, answer, label |
|
|
|
|
| 12 |
| 4 | `portuguese_hatebr` | Classification | `eduagarcia/portuguese_benchmark` (HateBR) | f1_macro |
|
| 13 |
| 5 | `portuguese_hate_speech` | Classification | `eduagarcia/portuguese_benchmark` (Hate Speech) | f1_macro |
|
| 14 |
| 6 | `portuguese_tweetsentbr` | Classification | `eduagarcia/tweetsentbr_fewshot` | f1_macro |
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
### Subgroups
|
| 17 |
|
| 18 |
+
| Group | Tasks |
|
| 19 |
+
| --------------------------- | -------------------------------- |
|
| 20 |
+
| `portuguese_mcq` | enem, bluex, oab_exams |
|
| 21 |
+
| `portuguese_classification` | hatebr, hate_speech, tweetsentbr |
|
|
|
|
| 22 |
|
| 23 |
## Setup
|
| 24 |
|
|
|
|
| 35 |
export INCLUDE_PATH="$(pwd)"
|
| 36 |
```
|
| 37 |
|
| 38 |
+
### Run the Entire Portuguese (all 6 tasks)
|
| 39 |
|
| 40 |
```bash
|
| 41 |
OPENAI_API_KEY="your-key" \
|
|
|
|
| 59 |
|
| 60 |
# Classification (hate speech, sentiment)
|
| 61 |
lm_eval --include_path $INCLUDE_PATH --tasks portuguese_classification ...
|
|
|
|
|
|
|
|
|
|
| 62 |
```
|
| 63 |
|
| 64 |
### Run a Single Task
|
|
|
|
| 78 |
|
| 79 |
# Sentiment analysis
|
| 80 |
lm_eval --include_path $INCLUDE_PATH --tasks portuguese_tweetsentbr ...
|
|
|
|
|
|
|
|
|
|
| 81 |
```
|
| 82 |
|
| 83 |
### Run with a Local HuggingFace Model
|
|
|
|
| 96 |
### Mix and Match
|
| 97 |
|
| 98 |
```bash
|
| 99 |
+
# Run ENEM + HateBR only
|
| 100 |
+
lm_eval --include_path $INCLUDE_PATH --tasks portuguese_enem,portuguese_hatebr ...
|
| 101 |
```
|
| 102 |
|
| 103 |
## Output
|
|
|
|
| 109 |
|
| 110 |
## Dataset Sources
|
| 111 |
|
| 112 |
+
| Dataset | Source | Config | Fields |
|
| 113 |
+
| ---------------------- | -------------------------------------- | ------------------------------- | ---------------------------- |
|
| 114 |
+
| ENEM | `eduagarcia/enem_challenge` | — | question, choices, answerKey |
|
| 115 |
+
| BLUEX | `eduagarcia-temp/BLUEX_without_images` | — | question, choices, answerKey |
|
| 116 |
+
| OAB Exams | `eduagarcia/oab_exams` | — | question, choices, answerKey |
|
| 117 |
+
| HateBR | `eduagarcia/portuguese_benchmark` | `HateBR_offensive_binary` | sentence, label |
|
| 118 |
+
| Portuguese Hate Speech | `eduagarcia/portuguese_benchmark` | `Portuguese_Hate_Speech_binary` | sentence, label |
|
| 119 |
+
| TweetSentBR | `eduagarcia/tweetsentbr_fewshot` | — | sentence, label |
|
|
|
|
|
|
evals/portuguese/nli/portuguese_assin2_rte.yaml
DELETED
|
@@ -1,28 +0,0 @@
|
|
| 1 |
-
task: portuguese_assin2_rte
|
| 2 |
-
task_alias: assin2_rte
|
| 3 |
-
dataset_path: assin2
|
| 4 |
-
test_split: test
|
| 5 |
-
output_type: generate_until
|
| 6 |
-
generation_kwargs:
|
| 7 |
-
do_sample: false
|
| 8 |
-
max_gen_toks: 8192
|
| 9 |
-
until:
|
| 10 |
-
- "<|endoftext|>"
|
| 11 |
-
process_docs: !function utils.process_assin2_rte_docs
|
| 12 |
-
doc_to_text: "Indique se a hipótese pode ser inferida a partir da premissa. Responda apenas com \"Sim\" ou \"Não\".\n\nPremissa: {{premise}}\nHipótese: {{hypothesis}}\nPergunta: A hipótese pode ser inferida pela premissa? Sim ou Não?\nResposta:"
|
| 13 |
-
doc_to_target: "{{target}}"
|
| 14 |
-
filter_list:
|
| 15 |
-
- name: "get_label"
|
| 16 |
-
filter:
|
| 17 |
-
- function: "strip_think_recover"
|
| 18 |
-
- function: "regex"
|
| 19 |
-
regex_pattern: "(Sim|Não)"
|
| 20 |
-
group_select: 0
|
| 21 |
-
- function: "take_first"
|
| 22 |
-
process_results: !function utils.process_nli_results
|
| 23 |
-
metric_list:
|
| 24 |
-
- metric: f1_macro
|
| 25 |
-
aggregation: !function utils.macro_f1_agg
|
| 26 |
-
higher_is_better: true
|
| 27 |
-
metadata:
|
| 28 |
-
version: 1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/portuguese/nli/portuguese_assin2_sts.yaml
DELETED
|
@@ -1,22 +0,0 @@
|
|
| 1 |
-
task: portuguese_assin2_sts
|
| 2 |
-
task_alias: assin2_sts
|
| 3 |
-
dataset_path: assin2
|
| 4 |
-
test_split: test
|
| 5 |
-
output_type: generate_until
|
| 6 |
-
generation_kwargs:
|
| 7 |
-
do_sample: false
|
| 8 |
-
max_gen_toks: 8192
|
| 9 |
-
until:
|
| 10 |
-
- "<|endoftext|>"
|
| 11 |
-
doc_to_text: "Avalie o grau de similaridade entre as duas frases abaixo. Dê uma pontuação entre 1,0 e 5,0 (1,0 = pouco similar, 5,0 = muito similar).\n\nFrase 1: {{premise}}\nFrase 2: {{hypothesis}}\nPergunta: Quão similares são as duas frases? Dê uma pontuação entre 1,0 a 5,0.\nResposta:"
|
| 12 |
-
doc_to_target: !function utils.assin2_float_to_pt_str
|
| 13 |
-
process_results: !function utils.process_sts_results
|
| 14 |
-
metric_list:
|
| 15 |
-
- metric: pearson
|
| 16 |
-
aggregation: !function utils.pearson_agg
|
| 17 |
-
higher_is_better: true
|
| 18 |
-
- metric: mse
|
| 19 |
-
aggregation: !function utils.mse_agg
|
| 20 |
-
higher_is_better: false
|
| 21 |
-
metadata:
|
| 22 |
-
version: 1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/portuguese/nli/portuguese_faquad_nli.yaml
DELETED
|
@@ -1,28 +0,0 @@
|
|
| 1 |
-
task: portuguese_faquad_nli
|
| 2 |
-
task_alias: faquad_nli
|
| 3 |
-
dataset_path: ruanchaves/faquad-nli
|
| 4 |
-
test_split: test
|
| 5 |
-
output_type: generate_until
|
| 6 |
-
generation_kwargs:
|
| 7 |
-
do_sample: false
|
| 8 |
-
max_gen_toks: 8192
|
| 9 |
-
until:
|
| 10 |
-
- "<|endoftext|>"
|
| 11 |
-
process_docs: !function utils.process_faquad_nli_docs
|
| 12 |
-
doc_to_text: "Julgue se a resposta satisfaz à pergunta de maneira satisfatória. Responda apenas com \"Sim\" ou \"Não\".\n\nPergunta: {{question}}\nResposta: {{answer}}\nA resposta dada satisfaz à pergunta? Sim ou Não?"
|
| 13 |
-
doc_to_target: "{{target}}"
|
| 14 |
-
filter_list:
|
| 15 |
-
- name: "get_label"
|
| 16 |
-
filter:
|
| 17 |
-
- function: "strip_think_recover"
|
| 18 |
-
- function: "regex"
|
| 19 |
-
regex_pattern: "(Sim|Não)"
|
| 20 |
-
group_select: 0
|
| 21 |
-
- function: "take_first"
|
| 22 |
-
process_results: !function utils.process_nli_results
|
| 23 |
-
metric_list:
|
| 24 |
-
- metric: f1_macro
|
| 25 |
-
aggregation: !function utils.macro_f1_agg
|
| 26 |
-
higher_is_better: true
|
| 27 |
-
metadata:
|
| 28 |
-
version: 1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/portuguese/nli/portuguese_nli.yaml
DELETED
|
@@ -1,18 +0,0 @@
|
|
| 1 |
-
# Natural Language Inference subgroup (ASSIN2 RTE + FaQuAD NLI + ASSIN2 STS)
|
| 2 |
-
group: portuguese_nli
|
| 3 |
-
task:
|
| 4 |
-
- portuguese_assin2_rte
|
| 5 |
-
- portuguese_faquad_nli
|
| 6 |
-
- portuguese_assin2_sts
|
| 7 |
-
aggregate_metric_list:
|
| 8 |
-
- metric: f1_macro
|
| 9 |
-
aggregation: mean
|
| 10 |
-
weight_by_size: true
|
| 11 |
-
- metric: pearson
|
| 12 |
-
aggregation: mean
|
| 13 |
-
weight_by_size: true
|
| 14 |
-
- metric: mse
|
| 15 |
-
aggregation: mean
|
| 16 |
-
weight_by_size: true
|
| 17 |
-
metadata:
|
| 18 |
-
version: 1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/portuguese/nli/utils.py
DELETED
|
@@ -1,102 +0,0 @@
|
|
| 1 |
-
"""Utility helpers for Portuguese NLI / STS tasks (generative mode).
|
| 2 |
-
|
| 3 |
-
Covers ASSIN2 (RTE & STS) and FaQuAD-NLI.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import os as _os, sys as _sys # noqa: E401
|
| 7 |
-
_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..","..",)))
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
import math
|
| 11 |
-
import re
|
| 12 |
-
|
| 13 |
-
from f1_utils import macro_f1_agg, process_results_f1 # noqa: F401
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
# ── Document pre-processing ─────────────────────────────────────────
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
def process_assin2_rte_docs(dataset):
|
| 20 |
-
"""Map entailment_judgment 0/1 → Não/Sim."""
|
| 21 |
-
|
| 22 |
-
def _map(doc):
|
| 23 |
-
doc["target"] = "Sim" if doc["entailment_judgment"] == 1 else "Não"
|
| 24 |
-
return doc
|
| 25 |
-
|
| 26 |
-
return dataset.map(_map)
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
def process_faquad_nli_docs(dataset):
|
| 30 |
-
"""Map label 0/1 → Não/Sim."""
|
| 31 |
-
|
| 32 |
-
def _map(doc):
|
| 33 |
-
doc["target"] = "Sim" if doc["label"] == 1 else "Não"
|
| 34 |
-
return doc
|
| 35 |
-
|
| 36 |
-
return dataset.map(_map)
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
# ── NLI result processing ────────────────────────────────────────────
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
def process_nli_results(doc, results):
|
| 43 |
-
"""Return (pred, gold) tuple for macro-F1 aggregation."""
|
| 44 |
-
return process_results_f1(doc, results)
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
# ── STS helpers ──────────────────────────────────────────────────────
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
def assin2_float_to_pt_str(doc):
|
| 51 |
-
"""Format relatedness_score as a Portuguese-style decimal string (comma)."""
|
| 52 |
-
return "{:.1f}".format(doc["relatedness_score"]).replace(".", ",")
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
def _extract_float(text):
|
| 56 |
-
"""Extract a float value from text, handling Portuguese comma notation."""
|
| 57 |
-
text = text.strip()
|
| 58 |
-
# Try to find a number pattern (comma or dot as decimal separator)
|
| 59 |
-
match = re.search(r"(\d+[,.]?\d*)", text)
|
| 60 |
-
if match:
|
| 61 |
-
num_str = match.group(1).replace(",", ".")
|
| 62 |
-
try:
|
| 63 |
-
val = float(num_str)
|
| 64 |
-
return max(1.0, min(5.0, val)) # Clip to [1.0, 5.0]
|
| 65 |
-
except ValueError:
|
| 66 |
-
pass
|
| 67 |
-
return 5.0 # Default fallback (same as original)
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
def process_sts_results(doc, results):
|
| 71 |
-
"""Extract predicted float and pair with gold for pearson/mse."""
|
| 72 |
-
pred_text = results[0].strip() if results[0] else ""
|
| 73 |
-
pred_val = _extract_float(pred_text)
|
| 74 |
-
gold_val = doc["relatedness_score"]
|
| 75 |
-
return {
|
| 76 |
-
"pearson": (pred_val, gold_val),
|
| 77 |
-
"mse": (pred_val, gold_val),
|
| 78 |
-
}
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
def pearson_agg(items):
|
| 82 |
-
"""Compute Pearson correlation coefficient."""
|
| 83 |
-
preds = [item[0] for item in items]
|
| 84 |
-
golds = [item[1] for item in items]
|
| 85 |
-
n = len(preds)
|
| 86 |
-
if n < 2:
|
| 87 |
-
return 0.0
|
| 88 |
-
mean_p = sum(preds) / n
|
| 89 |
-
mean_g = sum(golds) / n
|
| 90 |
-
cov = sum((p - mean_p) * (g - mean_g) for p, g in zip(preds, golds)) / n
|
| 91 |
-
std_p = math.sqrt(sum((p - mean_p) ** 2 for p in preds) / n)
|
| 92 |
-
std_g = math.sqrt(sum((g - mean_g) ** 2 for g in golds) / n)
|
| 93 |
-
if std_p * std_g == 0:
|
| 94 |
-
return 0.0
|
| 95 |
-
return cov / (std_p * std_g)
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
def mse_agg(items):
|
| 99 |
-
"""Compute mean squared error."""
|
| 100 |
-
preds = [item[0] for item in items]
|
| 101 |
-
golds = [item[1] for item in items]
|
| 102 |
-
return sum((p - g) ** 2 for p, g in zip(preds, golds)) / len(preds)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/portuguese/portuguese.yaml
CHANGED
|
@@ -8,11 +8,9 @@
|
|
| 8 |
# Sub-groups:
|
| 9 |
# mcq → exam multiple-choice (enem, bluex, oab_exams)
|
| 10 |
# classification → sentiment / hate / emotion classification
|
| 11 |
-
# nli → natural language inference & textual similarity
|
| 12 |
group: portuguese
|
| 13 |
task:
|
| 14 |
- portuguese_mcq
|
| 15 |
- portuguese_classification
|
| 16 |
-
- portuguese_nli
|
| 17 |
metadata:
|
| 18 |
version: 1.0
|
|
|
|
| 8 |
# Sub-groups:
|
| 9 |
# mcq → exam multiple-choice (enem, bluex, oab_exams)
|
| 10 |
# classification → sentiment / hate / emotion classification
|
|
|
|
| 11 |
group: portuguese
|
| 12 |
task:
|
| 13 |
- portuguese_mcq
|
| 14 |
- portuguese_classification
|
|
|
|
| 15 |
metadata:
|
| 16 |
version: 1.0
|
evals/run_eval.py
CHANGED
|
@@ -456,10 +456,21 @@ def run_single_eval(
|
|
| 456 |
gen_kwargs: str | None = None,
|
| 457 |
hf_hub: dict | None = None,
|
| 458 |
endpoint_kind: str = "chat_completions",
|
|
|
|
|
|
|
| 459 |
) -> dict | None:
|
| 460 |
-
"""Run a single lm_eval evaluation via the Python API and return results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 461 |
_ensure_lm_eval_api_key()
|
| 462 |
-
model_args =
|
|
|
|
|
|
|
|
|
|
| 463 |
|
| 464 |
tracker = _build_evaluation_tracker(output_path, hf_hub)
|
| 465 |
|
|
@@ -957,6 +968,11 @@ def main():
|
|
| 957 |
if args.num_concurrent is not None
|
| 958 |
else defaults.get("num_concurrent", 5)
|
| 959 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 960 |
|
| 961 |
eval_kwargs = dict(
|
| 962 |
include_path=SCRIPT_DIR,
|
|
@@ -976,6 +992,8 @@ def main():
|
|
| 976 |
gen_kwargs=gen_kwargs,
|
| 977 |
hf_hub=hf_hub,
|
| 978 |
endpoint_kind=model.get("endpoint_kind", "chat_completions"),
|
|
|
|
|
|
|
| 979 |
)
|
| 980 |
|
| 981 |
header = f"[{run_idx}/{total}] {model['name']} x {task['name']}"
|
|
|
|
| 456 |
gen_kwargs: str | None = None,
|
| 457 |
hf_hub: dict | None = None,
|
| 458 |
endpoint_kind: str = "chat_completions",
|
| 459 |
+
max_retries: int = 3,
|
| 460 |
+
timeout: int = 300,
|
| 461 |
) -> dict | None:
|
| 462 |
+
"""Run a single lm_eval evaluation via the Python API and return results.
|
| 463 |
+
|
| 464 |
+
``max_retries`` and ``timeout`` are forwarded into lm-eval's TemplateAPI
|
| 465 |
+
via ``model_args``. Defaults match lm-eval's own stock values; the TOML
|
| 466 |
+
``[defaults]`` block typically overrides them for Functionary endpoints
|
| 467 |
+
(see eval_config.toml for the rationale).
|
| 468 |
+
"""
|
| 469 |
_ensure_lm_eval_api_key()
|
| 470 |
+
model_args = (
|
| 471 |
+
f"model={model_id},base_url={base_url},num_concurrent={num_concurrent},"
|
| 472 |
+
f"max_retries={max_retries},timeout={timeout}"
|
| 473 |
+
)
|
| 474 |
|
| 475 |
tracker = _build_evaluation_tracker(output_path, hf_hub)
|
| 476 |
|
|
|
|
| 968 |
if args.num_concurrent is not None
|
| 969 |
else defaults.get("num_concurrent", 5)
|
| 970 |
)
|
| 971 |
+
# Resilience knobs read from [defaults]; see eval_config.toml.
|
| 972 |
+
# Stock lm-eval defaults (3 / 300) are kept as fallbacks so the
|
| 973 |
+
# behavior is unchanged when the TOML doesn't override them.
|
| 974 |
+
max_retries = int(defaults.get("max_retries", 3))
|
| 975 |
+
timeout = int(defaults.get("timeout", 300))
|
| 976 |
|
| 977 |
eval_kwargs = dict(
|
| 978 |
include_path=SCRIPT_DIR,
|
|
|
|
| 992 |
gen_kwargs=gen_kwargs,
|
| 993 |
hf_hub=hf_hub,
|
| 994 |
endpoint_kind=model.get("endpoint_kind", "chat_completions"),
|
| 995 |
+
max_retries=max_retries,
|
| 996 |
+
timeout=timeout,
|
| 997 |
)
|
| 998 |
|
| 999 |
header = f"[{run_idx}/{total}] {model['name']} x {task['name']}"
|
evals/ukrainian/README.md
CHANGED
|
@@ -7,15 +7,14 @@ Ukrainian (`ukr_Cyrl` / macro `uk`) evaluation suite for the
|
|
| 7 |
|
| 8 |
### Custom Tasks (require `--include_path`)
|
| 9 |
|
| 10 |
-
| # | Task Name
|
| 11 |
-
| --- | -----------------------
|
| 12 |
-
| 1 | `ukrainian_sib200`
|
| 13 |
-
| 2 | `ukrainian_belebele`
|
| 14 |
-
| 3 | `ukrainian_global_mmlu`
|
| 15 |
-
| 4 | `ukrainian_zno`
|
| 16 |
-
| 5 | `ukrainian_squad`
|
| 17 |
-
| 6 | `
|
| 18 |
-
| 7 | `ukrainian_polywrite` | Open generation | `MaLA-LM/PolyWrite` (filtered `lang_script=ukr_Cyrl`) | open_quality_score |
|
| 19 |
|
| 20 |
#### Subgroups
|
| 21 |
|
|
@@ -24,7 +23,6 @@ Ukrainian (`ukr_Cyrl` / macro `uk`) evaluation suite for the
|
|
| 24 |
| `ukrainian_classification` | sib200 |
|
| 25 |
| `ukrainian_mcq` | belebele, global_mmlu, zno |
|
| 26 |
| `ukrainian_qa` | squad |
|
| 27 |
-
| `ukrainian_summarization` | massivesumm_long |
|
| 28 |
| `ukrainian_open_generation` | polywrite |
|
| 29 |
|
| 30 |
## Setup
|
|
@@ -41,7 +39,7 @@ All commands must be run from the `multilingual_bench/` directory:
|
|
| 41 |
cd /path/to/functionary_internal/evaluation/multilingual_bench
|
| 42 |
```
|
| 43 |
|
| 44 |
-
### Run the Entire Ukrainian Suite (all
|
| 45 |
|
| 46 |
```bash
|
| 47 |
OPENAI_API_KEY="$OPENROUTER_API_KEY" \
|
|
@@ -70,7 +68,6 @@ python run_eval.py --models gpt-5-mini --tasks ukrainian
|
|
| 70 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_classification ...
|
| 71 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_mcq ...
|
| 72 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_qa ...
|
| 73 |
-
lm_eval --include_path lm_eval_tasks --tasks ukrainian_summarization ...
|
| 74 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_open_generation ...
|
| 75 |
```
|
| 76 |
|
|
@@ -82,7 +79,6 @@ lm_eval --include_path lm_eval_tasks --tasks ukrainian_belebele ...
|
|
| 82 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_global_mmlu ...
|
| 83 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_zno ...
|
| 84 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_squad ...
|
| 85 |
-
lm_eval --include_path lm_eval_tasks --tasks ukrainian_massivesumm_long ...
|
| 86 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_polywrite ...
|
| 87 |
```
|
| 88 |
|
|
@@ -95,26 +91,14 @@ With `--log_samples`, the output directory contains:
|
|
| 95 |
|
| 96 |
## Dataset Sources
|
| 97 |
|
| 98 |
-
| Dataset
|
| 99 |
-
| -----------
|
| 100 |
-
| SIB-200
|
| 101 |
-
| Belebele
|
| 102 |
-
| Global-MMLU
|
| 103 |
-
| ZNO
|
| 104 |
-
| UA-SQuAD
|
| 105 |
-
|
|
| 106 |
-
| PolyWrite | `MaLA-LM/PolyWrite` | — (filter `lang_script=ukr_Cyrl`) | `prompt_translated`, `category`, `lang_script` (no reference answer) |
|
| 107 |
-
|
| 108 |
-
### Gated datasets
|
| 109 |
-
|
| 110 |
-
The MassiveSumm dataset is gated on Hugging Face. Accept the terms (once) and export an HF token before running:
|
| 111 |
-
|
| 112 |
-
- MassiveSumm long: <https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long>
|
| 113 |
-
|
| 114 |
-
```bash
|
| 115 |
-
export HF_TOKEN="hf_..."
|
| 116 |
-
huggingface-cli login # one-time, optional if HF_TOKEN is exported
|
| 117 |
-
```
|
| 118 |
|
| 119 |
### LLM-judge tasks
|
| 120 |
|
|
|
|
| 7 |
|
| 8 |
### Custom Tasks (require `--include_path`)
|
| 9 |
|
| 10 |
+
| # | Task Name | Category | Dataset (HuggingFace) | Metric |
|
| 11 |
+
| --- | ----------------------- | ----------------- | -------------------------------------------------------------- | ------------------ |
|
| 12 |
+
| 1 | `ukrainian_sib200` | Classification | `Davlan/sib200` (`ukr_Cyrl`) | f1_macro |
|
| 13 |
+
| 2 | `ukrainian_belebele` | MCQ | `facebook/belebele` (`ukr_Cyrl`) | f1_macro |
|
| 14 |
+
| 3 | `ukrainian_global_mmlu` | MCQ | `CohereLabs/Global-MMLU` (`uk`, filtered to CS+CA, ~400 items) | f1_macro |
|
| 15 |
+
| 4 | `ukrainian_zno` | MCQ (native exam) | `osyvokon/zno` | f1_macro |
|
| 16 |
+
| 5 | `ukrainian_squad` | Extractive QA | `HPLT/ua-squad` (revised UA-SQuAD) | exact_match + f1 |
|
| 17 |
+
| 6 | `ukrainian_polywrite` | Open generation | `MaLA-LM/PolyWrite` (filtered `lang_script=ukr_Cyrl`) | open_quality_score |
|
|
|
|
| 18 |
|
| 19 |
#### Subgroups
|
| 20 |
|
|
|
|
| 23 |
| `ukrainian_classification` | sib200 |
|
| 24 |
| `ukrainian_mcq` | belebele, global_mmlu, zno |
|
| 25 |
| `ukrainian_qa` | squad |
|
|
|
|
| 26 |
| `ukrainian_open_generation` | polywrite |
|
| 27 |
|
| 28 |
## Setup
|
|
|
|
| 39 |
cd /path/to/functionary_internal/evaluation/multilingual_bench
|
| 40 |
```
|
| 41 |
|
| 42 |
+
### Run the Entire Ukrainian Suite (all 6 tasks)
|
| 43 |
|
| 44 |
```bash
|
| 45 |
OPENAI_API_KEY="$OPENROUTER_API_KEY" \
|
|
|
|
| 68 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_classification ...
|
| 69 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_mcq ...
|
| 70 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_qa ...
|
|
|
|
| 71 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_open_generation ...
|
| 72 |
```
|
| 73 |
|
|
|
|
| 79 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_global_mmlu ...
|
| 80 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_zno ...
|
| 81 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_squad ...
|
|
|
|
| 82 |
lm_eval --include_path lm_eval_tasks --tasks ukrainian_polywrite ...
|
| 83 |
```
|
| 84 |
|
|
|
|
| 91 |
|
| 92 |
## Dataset Sources
|
| 93 |
|
| 94 |
+
| Dataset | Source | Config | Notes |
|
| 95 |
+
| ----------- | ------------------------ | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| 96 |
+
| SIB-200 | `Davlan/sib200` | `ukr_Cyrl` | text + ClassLabel `category` (7 topics) |
|
| 97 |
+
| Belebele | `facebook/belebele` | `ukr_Cyrl` | flores_passage + question + 4 mc_answers, `correct_answer_num` 1-4 |
|
| 98 |
+
| Global-MMLU | `CohereLabs/Global-MMLU` | `uk` | Loaded from the **full** Global-MMLU (Lite doesn't ship `uk`) and filtered in `process_global_mmlu_docs` to `cultural_sensitivity_label ∈ {CS, CA}`, giving ~400 culturally-annotated items — the same subset Lite ships for other languages. Schema: `question`, `option_a..d`, `answer` letter, `subject`, `cultural_sensitivity_label`. |
|
| 99 |
+
| ZNO | `osyvokon/zno` | — | Native Ukrainian high-school exam (Ukrainian language & literature + history of Ukraine). 751 test items (2020-2023). Cyrillic markers А/Б/В/Г/Д, gold in `correct_answers[0]` |
|
| 100 |
+
| UA-SQuAD | `HPLT/ua-squad` | — | Native Ukrainian extractive QA (revised UA-SQuAD). 7,729 rows, ~50/50 `train`/`test` (we use `test`). Standard SQuAD schema: `id`, `context`, `question`, `answers.{text, answer_start}`. MIT license. |
|
| 101 |
+
| PolyWrite | `MaLA-LM/PolyWrite` | — (filter `lang_script=ukr_Cyrl`) | `prompt_translated`, `category`, `lang_script` (no reference answer) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
### LLM-judge tasks
|
| 104 |
|
evals/ukrainian/summarization/_default_summarization_yaml
DELETED
|
@@ -1,18 +0,0 @@
|
|
| 1 |
-
# Shared config for Ukrainian summarization tasks (MassiveSumm).
|
| 2 |
-
# The Ukrainian-only filter is applied in process_docs (the upstream
|
| 3 |
-
# datasets are highly multilingual single-table dumps, no per-language
|
| 4 |
-
# config). Scoring is sentence-level ROUGE-L F1.
|
| 5 |
-
output_type: generate_until
|
| 6 |
-
test_split: train
|
| 7 |
-
generation_kwargs:
|
| 8 |
-
do_sample: false
|
| 9 |
-
max_gen_toks: 8192
|
| 10 |
-
until:
|
| 11 |
-
- "<|endoftext|>"
|
| 12 |
-
process_results: !function utils.process_results
|
| 13 |
-
metric_list:
|
| 14 |
-
- metric: rouge_l
|
| 15 |
-
aggregation: !function utils.rouge_l_agg
|
| 16 |
-
higher_is_better: true
|
| 17 |
-
metadata:
|
| 18 |
-
version: 1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/ukrainian/summarization/ukrainian_massivesumm_long.yaml
DELETED
|
@@ -1,14 +0,0 @@
|
|
| 1 |
-
task: ukrainian_massivesumm_long
|
| 2 |
-
task_alias: massivesumm_long
|
| 3 |
-
# Gated dataset: accept terms at
|
| 4 |
-
# https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long and export HF_TOKEN.
|
| 5 |
-
dataset_path: MaLA-LM/MassiveSumm_long
|
| 6 |
-
include: _default_summarization_yaml
|
| 7 |
-
process_docs: !function utils.process_docs
|
| 8 |
-
generation_kwargs:
|
| 9 |
-
do_sample: false
|
| 10 |
-
max_gen_toks: 8192
|
| 11 |
-
until:
|
| 12 |
-
- "<|endoftext|>"
|
| 13 |
-
doc_to_text: "Ти — система реферування новин.\nСтисни наведену нижче статтю в короткий абзац (3-5 речень) українською мовою. Не додавай коментарів.\n\nСтаття:\n{{text}}\n\nСтислий виклад:"
|
| 14 |
-
doc_to_target: "{{summary}}"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/ukrainian/summarization/ukrainian_summarization.yaml
DELETED
|
@@ -1,10 +0,0 @@
|
|
| 1 |
-
# Summarization subgroup (MassiveSumm long)
|
| 2 |
-
group: ukrainian_summarization
|
| 3 |
-
task:
|
| 4 |
-
- ukrainian_massivesumm_long
|
| 5 |
-
aggregate_metric_list:
|
| 6 |
-
- metric: rouge_l
|
| 7 |
-
aggregation: mean
|
| 8 |
-
weight_by_size: true
|
| 9 |
-
metadata:
|
| 10 |
-
version: 1.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/ukrainian/summarization/utils.py
DELETED
|
@@ -1,111 +0,0 @@
|
|
| 1 |
-
"""Utility helpers for Ukrainian summarization tasks (MassiveSumm short/long).
|
| 2 |
-
|
| 3 |
-
Both MassiveSumm subsets are highly multilingual single-table datasets
|
| 4 |
-
(one ``train`` split, no language configs). We filter to Ukrainian rows
|
| 5 |
-
inside ``process_docs``. The HF dataset is **gated** — accept the terms
|
| 6 |
-
on the dataset page once and export ``HF_TOKEN`` before running.
|
| 7 |
-
|
| 8 |
-
Scoring uses ROUGE-L F1 via the ``rouge_score`` package, which is
|
| 9 |
-
already a transitive dependency of lm-evaluation-harness.
|
| 10 |
-
"""
|
| 11 |
-
|
| 12 |
-
from __future__ import annotations
|
| 13 |
-
|
| 14 |
-
import re
|
| 15 |
-
import string
|
| 16 |
-
|
| 17 |
-
import datasets
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
_UKRAINIAN_LANG_CODES = {"ukr", "uk"}
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
def _strip_think_tags(text: str) -> str:
|
| 24 |
-
"""Strip <think>...</think> reasoning wrapper (e.g. Qwen thinking models)."""
|
| 25 |
-
if "</think>" in text:
|
| 26 |
-
return text.split("</think>")[-1].strip()
|
| 27 |
-
return text
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
def _filter_ukrainian(dataset: datasets.Dataset) -> datasets.Dataset:
|
| 31 |
-
"""Keep rows whose ``language`` field is one of Ukrainian variants."""
|
| 32 |
-
if "language" not in dataset.column_names:
|
| 33 |
-
return dataset
|
| 34 |
-
return dataset.filter(lambda row: str(row.get("language", "")).lower() in _UKRAINIAN_LANG_CODES)
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
def _normalise_doc(doc):
|
| 38 |
-
"""Project the columns we actually need."""
|
| 39 |
-
text = (doc.get("text") or "").strip()
|
| 40 |
-
summary = (doc.get("summary") or "").strip()
|
| 41 |
-
title = (doc.get("title") or "").strip()
|
| 42 |
-
return {
|
| 43 |
-
"text": text,
|
| 44 |
-
"summary": summary,
|
| 45 |
-
"title": title,
|
| 46 |
-
}
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
|
| 50 |
-
filtered = _filter_ukrainian(dataset)
|
| 51 |
-
return filtered.map(_normalise_doc, remove_columns=[
|
| 52 |
-
c for c in filtered.column_names if c not in ("text", "summary", "title")
|
| 53 |
-
])
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
# ── ROUGE-L scoring ──────────────────────────────────────────────────
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
_PUNCT_TABLE = str.maketrans("", "", string.punctuation)
|
| 60 |
-
_WHITESPACE_RE = re.compile(r"\s+")
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
def _normalise(text: str) -> str:
|
| 64 |
-
text = text.translate(_PUNCT_TABLE)
|
| 65 |
-
text = _WHITESPACE_RE.sub(" ", text)
|
| 66 |
-
return text.strip().lower()
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
def _lcs_length(a, b):
|
| 70 |
-
m, n = len(a), len(b)
|
| 71 |
-
if m == 0 or n == 0:
|
| 72 |
-
return 0
|
| 73 |
-
dp = [0] * (n + 1)
|
| 74 |
-
for i in range(1, m + 1):
|
| 75 |
-
prev = 0
|
| 76 |
-
for j in range(1, n + 1):
|
| 77 |
-
tmp = dp[j]
|
| 78 |
-
if a[i - 1] == b[j - 1]:
|
| 79 |
-
dp[j] = prev + 1
|
| 80 |
-
else:
|
| 81 |
-
dp[j] = max(dp[j], dp[j - 1])
|
| 82 |
-
prev = tmp
|
| 83 |
-
return dp[n]
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
def _rouge_l_f1(pred: str, gold: str) -> float:
|
| 87 |
-
"""Compute sentence-level ROUGE-L F1 (no stemming) between pred and gold."""
|
| 88 |
-
pred_tokens = _normalise(pred).split()
|
| 89 |
-
gold_tokens = _normalise(gold).split()
|
| 90 |
-
if not pred_tokens or not gold_tokens:
|
| 91 |
-
return 0.0
|
| 92 |
-
lcs = _lcs_length(pred_tokens, gold_tokens)
|
| 93 |
-
if lcs == 0:
|
| 94 |
-
return 0.0
|
| 95 |
-
precision = lcs / len(pred_tokens)
|
| 96 |
-
recall = lcs / len(gold_tokens)
|
| 97 |
-
return 2 * precision * recall / (precision + recall)
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
def process_results(doc, results):
|
| 101 |
-
raw_response = results[0].strip() if results and results[0] else ""
|
| 102 |
-
pred = _strip_think_tags(raw_response)
|
| 103 |
-
gold = (doc.get("summary") or "").strip()
|
| 104 |
-
return {"rouge_l": (gold, pred)}
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
def rouge_l_agg(items):
|
| 108 |
-
if not items:
|
| 109 |
-
return 0.0
|
| 110 |
-
scores = [_rouge_l_f1(pred, gold) for gold, pred in items]
|
| 111 |
-
return sum(scores) / len(scores)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
evals/ukrainian/ukrainian.yaml
CHANGED
|
@@ -10,14 +10,12 @@
|
|
| 10 |
# Metrics:
|
| 11 |
# classification & mcq → f1_macro (per sub-group)
|
| 12 |
# qa → exact_match + f1 (per sub-group)
|
| 13 |
-
# summarization → rouge_l (per sub-group)
|
| 14 |
# open_generation → llm_judge_score / open_quality_score (per sub-group)
|
| 15 |
group: ukrainian
|
| 16 |
task:
|
| 17 |
- ukrainian_classification
|
| 18 |
- ukrainian_mcq
|
| 19 |
- ukrainian_qa
|
| 20 |
-
- ukrainian_summarization
|
| 21 |
- ukrainian_open_generation
|
| 22 |
metadata:
|
| 23 |
version: 1.0
|
|
|
|
| 10 |
# Metrics:
|
| 11 |
# classification & mcq → f1_macro (per sub-group)
|
| 12 |
# qa → exact_match + f1 (per sub-group)
|
|
|
|
| 13 |
# open_generation → llm_judge_score / open_quality_score (per sub-group)
|
| 14 |
group: ukrainian
|
| 15 |
task:
|
| 16 |
- ukrainian_classification
|
| 17 |
- ukrainian_mcq
|
| 18 |
- ukrainian_qa
|
|
|
|
| 19 |
- ukrainian_open_generation
|
| 20 |
metadata:
|
| 21 |
version: 1.0
|