TunisianEncodersArena

Runtime error

App Files Files Community

hamzabouajila commited on Oct 6

Commit

bde1c71

1 Parent(s): cec147a

refactor the code for better scalability and update tsac naming to sentiment analysis, adding madar dataset for transliteration and normalization eval

Browse files

Files changed (23) hide show

Roadmap.md +236 -0
app.py +5 -5
pyproject.toml +7 -2
src/about.py +7 -6
src/configs/config.json +23 -0
src/configs/config.py +21 -0
src/evaluators/__init__.py +18 -0
src/evaluators/base_evaluator.py +17 -0
src/{evaluator → evaluators}/evaluate.py +56 -71
src/evaluators/madar_tun.py +108 -0
src/evaluators/normalization/__init__.py +1 -0
src/evaluators/normalization/datasets.py +10 -0
src/evaluators/normalization/evaluator.py +96 -0
src/{evaluator → evaluators}/run_evaluator.py +1 -1
src/evaluators/sentiment_analysis/__init__.py +0 -0
src/evaluators/sentiment_analysis/dataset.py +0 -0
src/evaluators/sentiment_analysis/evaluator.py +207 -0
src/evaluators/transliteration/__init__.py +1 -0
src/evaluators/transliteration/datasets.py +10 -0
src/evaluators/transliteration/evaluator.py +96 -0
src/evaluators/tsac.py +133 -0
src/{evaluator → evaluators}/tunisian_corpus_coverage.py +0 -0
src/submission/submit.py +1 -1

Roadmap.md ADDED Viewed

	@@ -0,0 +1,236 @@

+## 🗺️ Tunisian NLP Leaderboard Roadmap
+### 📌 Phase 1: Dataset Acquisition & Preparation
+#### 1. **Sentiment Analysis**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: A large dataset containing 100,000 Tunisian Arabizi comments annotated as positive, negative, or neutral.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Utilize this dataset to evaluate models' performance in sentiment classification tasks.
+#### 2. **Named Entity Recognition (NER)**
+* **Existing Dataset**: **ArabNER**
+  * **Description**: A comprehensive Arabic NER corpus that can be adapted for Tunisian dialects.
+  * **Source**: [ResearchGate](https://www.researchgate.net/publication/374279027_Named_Entity_Recognition_of_Tunisian_Arabic_Using_the_Bi-LSTM-CRF_Model)
+* **Usage**: Fine-tune models on this dataset to assess their ability to recognize entities in Tunisian Arabic text.
+#### 3. **Corpus Coverage**
+* **Existing Dataset**: **Tunisian Dialect Corpus**
+  * **Description**: A sizable collection of Tunisian dialect texts, useful for assessing vocabulary coverage.
+  * **Source**: [Hugging Face](https://huggingface.co/collections/tunis-ai/arabic-datasets-66344cf0df31dc81eb1dcf55)
+* **Usage**: Evaluate models' coverage of the Tunisian dialect vocabulary using this corpus.
+#### 4. **Arabizi Robustness**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: Since it's in Arabizi, it can also serve to evaluate models' robustness to this writing style.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Assess models' robustness to Arabizi by evaluating their performance on this dataset.
+#### 5. **Code-Switching**
+* **Existing Dataset**: **TunSwitch**
+  * **Description**: A dataset of code-switched Tunisian Arabic speech, valuable for training and evaluating models on code-switching tasks.
+  * **Source**: [Zenodo](https://zenodo.org/records/8342762)
+* **Usage**: Evaluate models' ability to handle code-switching between Tunisian Arabic and other languages using this dataset.
+#### 6. **Typo Robustness**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: Its informal nature includes typographical variations, making it suitable for evaluating models' tolerance to typos.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Assess models' robustness to typographical errors by evaluating their performance on this dataset.
+#### 7. **Zero-Shot Transfer**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: Can be used to test models' ability to generalize to tasks they weren't explicitly trained on.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Evaluate models' zero-shot transfer capabilities by assessing their performance on this dataset.
+#### 8. **Domain Shift**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: Its diverse sources provide a foundation for testing domain adaptation capabilities.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Assess models' ability to adapt to different domains by evaluating their performance on this dataset.
+---
+### 🧪 Phase 2: Metric Development & Evaluation Tasks
+For each task, define the evaluation metric and the corresponding dataset:
+| Task                     | Metric                  | Dataset                                                                                                                               |
+| ------------------------ | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
+| Sentiment Analysis       | Accuracy / F1 Score     | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/)                                                                        |
+| Named Entity Recognition | F1 Score                | [ArabNER](https://www.researchgate.net/publication/374279027_Named_Entity_Recognition_of_Tunisian_Arabic_Using_the_Bi-LSTM-CRF_Model) |
+| Corpus Coverage          | Vocabulary Coverage (%) | [Tunisian Dialect Corpus](https://huggingface.co/collections/tunis-ai/arabic-datasets-66344cf0df31dc81eb1dcf55)                       |
+| Arabizi Robustness       | Accuracy / F1 Score     | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/)                                                                        |
+| Code-Switching           | Accuracy / F1 Score     | [TunSwitch](https://zenodo.org/records/8342762)                                                                                       |
+| Typo Robustness          | Accuracy / F1 Score     | [TUNIZI]([https://k4all.org/project/database-tunisian](https://k4all.org/project/database-tunisian)                                   |
+Certainly! Here's a comprehensive roadmap to guide you through enhancing your **TunisianEncoderModelsLeaderboard** project, focusing on dataset acquisition, metric development, and evaluation tasks.
+---
+## 🗺️ Tunisian NLP Leaderboard Roadmap
+### 📌 Phase 1: Dataset Acquisition & Preparation
+#### 1. **Sentiment Analysis**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: A large dataset containing 100,000 Tunisian Arabizi comments annotated as positive, negative, or neutral.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Utilize this dataset to evaluate models' performance in sentiment classification tasks.
+#### 2. **Named Entity Recognition (NER)**
+* **Existing Dataset**: **ArabNER**
+  * **Description**: A comprehensive Arabic NER corpus that can be adapted for Tunisian dialects.
+  * **Source**: [ResearchGate](https://www.researchgate.net/publication/374279027_Named_Entity_Recognition_of_Tunisian_Arabic_Using_the_Bi-LSTM-CRF_Model)
+* **Usage**: Fine-tune models on this dataset to assess their ability to recognize entities in Tunisian Arabic text.
+#### 3. **Corpus Coverage**
+* **Existing Dataset**: **Tunisian Dialect Corpus**
+  * **Description**: A sizable collection of Tunisian dialect texts, useful for assessing vocabulary coverage.
+  * **Source**: [Hugging Face](https://huggingface.co/collections/tunis-ai/arabic-datasets-66344cf0df31dc81eb1dcf55)
+* **Usage**: Evaluate models' coverage of the Tunisian dialect vocabulary using this corpus.
+#### 4. **Arabizi Robustness**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: Since it's in Arabizi, it can also serve to evaluate models' robustness to this writing style.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Assess models' robustness to Arabizi by evaluating their performance on this dataset.
+#### 5. **Code-Switching**
+* **Existing Dataset**: **TunSwitch**
+  * **Description**: A dataset of code-switched Tunisian Arabic speech, valuable for training and evaluating models on code-switching tasks.
+  * **Source**: [Zenodo](https://zenodo.org/records/8342762)
+* **Usage**: Evaluate models' ability to handle code-switching between Tunisian Arabic and other languages using this dataset.
+#### 6. **Typo Robustness**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: Its informal nature includes typographical variations, making it suitable for evaluating models' tolerance to typos.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Assess models' robustness to typographical errors by evaluating their performance on this dataset.
+#### 7. **Zero-Shot Transfer**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: Can be used to test models' ability to generalize to tasks they weren't explicitly trained on.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Evaluate models' zero-shot transfer capabilities by assessing their performance on this dataset.
+#### 8. **Domain Shift**
+* **Existing Dataset**: **TUNIZI**
+  * **Description**: Its diverse sources provide a foundation for testing domain adaptation capabilities.
+  * **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
+* **Usage**: Assess models' ability to adapt to different domains by evaluating their performance on this dataset.
+---
+### 🧪 Phase 2: Metric Development & Evaluation Tasks
+For each task, define the evaluation metric and the corresponding dataset:
+| Task                     | Metric                  | Dataset                                                                                                                               |
+| ------------------------ | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
+| Sentiment Analysis       | Accuracy / F1 Score     | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/)                                                                        |
+| Named Entity Recognition | F1 Score                | [ArabNER](https://www.researchgate.net/publication/374279027_Named_Entity_Recognition_of_Tunisian_Arabic_Using_the_Bi-LSTM-CRF_Model) |
+| Corpus Coverage          | Vocabulary Coverage (%) | [Tunisian Dialect Corpus](https://huggingface.co/collections/tunis-ai/arabic-datasets-66344cf0df31dc81eb1dcf55)                       |
+| Arabizi Robustness       | Accuracy / F1 Score     | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/)                                                                        |
+| Code-Switching           | Accuracy / F1 Score     | [TunSwitch](https://zenodo.org/records/8342762)                                                                                       |
+| Typo Robustness          | Accuracy / F1 Score     | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/)                                                                        |
+| Zero-Shot Transfer       | Accuracy / F1 Score     | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/)                                                                        |
+| Domain Shift             | Accuracy / F1 Score     | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/)                                                                        |
+---
+### 🗂️ Suggested Folder Structure
+To maintain organization and clarity, consider the following structure:
+```
+TunisianEncoderModelsLeaderboard/
+├── datasets/
+│   ├── sentiment/
+│   │   └── tunizi.json
+│   ├── ner/
+│   │   └── arabner.json
+│   ├── coverage/
+│   │   └── tunisian_dialect_corpus.json
+│   ├── arabizi_robustness/
+│   │   └── tunizi.json
+│   ├── code_switching/
+│   │   └── tunswitch.json
+│   ├── typo_robustness/
+│   │   └── tunizi_with_typos.json
+│   ├── zero_shot/
+│   │   └── tunizi.json
+│   └── domain_shift/
+│       └── tunisian_domain_shift.json
+├── scripts/
+│   ├── preprocess.py
+│   ├── evaluate.py
+│   └── visualize.py
+└── README.md
+```
+---
+### ✅ Next Steps
+1. **Integrate Existing Datasets**: Incorporate the datasets mentioned above into your repository, ensuring they are properly formatted and documented.
+2. **Develop Evaluation Scripts**: Write scripts to evaluate models on each task, ensuring they are compatible with the leaderboard format.
+3. **Populate the Leaderboard**: As models are evaluated, update the leaderboard to reflect their performance across tasks.
+4. **Documentation**: Update the README.md file to provide clear instructions on how to use the leaderboard, contribute models, and interpret results.
+---
+If you need assistance with data collection, annotation guidelines, or script development, feel free to ask!

app.py CHANGED Viewed

@@ -1,6 +1,5 @@
 from dotenv import load_dotenv
-load_dotenv()
 import gradio as gr
 from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
@@ -31,8 +30,9 @@ from src.display.utils import (
 from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
 from src.populate import get_evaluation_queue_df, get_leaderboard_df
 from src.submission.submit import add_new_eval
-from src.evaluator.run_evaluator import evaluator_runner
 def restart_space():
     try:
         print("Restarting space...")
@@ -240,9 +240,9 @@ with demo:
-scheduler = BackgroundScheduler()
-scheduler.add_job(restart_space, "interval", seconds=120)
 thread = threading.Thread(target=evaluator_runner)
-scheduler.start()
 thread.start()
 demo.queue(default_concurrency_limit=40).launch()

 from dotenv import load_dotenv
 import gradio as gr
 from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
 from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
 from src.populate import get_evaluation_queue_df, get_leaderboard_df
 from src.submission.submit import add_new_eval
+from src.evaluators.run_evaluator import evaluator_runner
+load_dotenv()
 def restart_space():
     try:
         print("Restarting space...")
+# scheduler = BackgroundScheduler()
+# scheduler.add_job(restart_space, "interval", seconds=120)
 thread = threading.Thread(target=evaluator_runner)
+# scheduler.start()
 thread.start()
 demo.queue(default_concurrency_limit=40).launch()

pyproject.toml CHANGED Viewed

@@ -12,8 +12,6 @@ dependencies = [
     "gradio-leaderboard==0.0.13",
     "gradio[oauth]>=5.35.0",
     "huggingface-hub>=0.18.0",
-    "ipykernel>=6.29.5",
-    "ipywidgets>=8.1.7",
     "matplotlib>=3.10.3",
     "numpy>=2.3.1",
     "pandas>=2.3.0",
@@ -22,12 +20,19 @@ dependencies = [
     "python-dotenv>=1.1.1",
     "scikit-learn>=1.7.0",
     "sentencepiece>=0.2.0",
     "tokenizers>=0.15.0",
     "torch>=2.7.1",
     "tqdm>=4.67.1",
     "transformers>=4.53.1",
 ]
 [tool.ruff]
 # Enable pycodestyle (`E`) and Pyflakes (`F`) codes by default.
 select = ["E", "F"]

     "gradio-leaderboard==0.0.13",
     "gradio[oauth]>=5.35.0",
     "huggingface-hub>=0.18.0",
     "matplotlib>=3.10.3",
     "numpy>=2.3.1",
     "pandas>=2.3.0",
     "python-dotenv>=1.1.1",
     "scikit-learn>=1.7.0",
     "sentencepiece>=0.2.0",
+    "seqeval>=1.2.2",
     "tokenizers>=0.15.0",
     "torch>=2.7.1",
     "tqdm>=4.67.1",
     "transformers>=4.53.1",
 ]
+[project.optional-dependencies]
+dev = [
+    "ipykernel>=6.30.1",
+    "ipywidgets>=8.1.7",
+]
 [tool.ruff]
 # Enable pycodestyle (`E`) and Pyflakes (`F`) codes by default.
 select = ["E", "F"]

src/about.py CHANGED Viewed

@@ -8,13 +8,14 @@ class Task:
     col_name: str # Column name
-# Tunisian Dialect Tasks
-# ---------------------------------------------------
 class Tasks(Enum):
-    # Example: Sentiment Analysis on TSAC
-    accuracy = Task("fbougares/tsac", "accuracy", "Accuracy (TSAC) ⬆️")
-    # Example: Text Classification or Corpus Coverage on Tunisian Dialect Corpus
-    coverage = Task("arbml/Tunisian_Dialect_Corpus", "coverage", "Coverage (Tunisian Corpus) %")
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------

     col_name: str # Column name
 class Tasks(Enum):
+    sentiment_accuracy = Task("fbougares/tsac", "accuracy", "Accuracy (TSAC) ⬆️")
+    sentiment_f1 = Task("fbougares/tsac", "macro_f1", "Macro-F1 (TSAC) ⬆️")
+    ner_f1 = Task("arbml/tunisian_ner", "entity_f1", "Entity F1 (NER) ⬆️")
+    coverage = Task("arbml/Tunisian_Dialect_Corpus", "coverage", "Corpus Coverage % ⬆️")
+    arabizi_robustness = Task("tunis-ai/arabizi_eval", "arabizi_f1", "Arabizi Robustness F1 ⬆️")
+    code_switch = Task("tunis-ai/codeswitch_eval", "accuracy", "Code-Switch Accuracy ⬆️")
+    typo_robustness = Task("tunis-ai/typo_eval", "f1_drop", "Typo Robustness Drop % ⬇️")
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------

src/configs/config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+    "tsac": {
+        "path": "fbougares/tsac",
+        "text_column": "sentence",
+        "label_column": "target",
+        "label_map": {
+                "0": 0,
+                "1": 1
+                    },
+        "trust_remote_code": true
+    },
+    "tunisian_sentiment": {
+        "path": "your-org/tunisian-sentiment",
+        "text_column": "text",
+        "label_column": "label",
+        "label_map": {
+            "negative": 0,
+            "positive": 1,
+            "neutral": -1
+        },
+        "trust_remote_code": false
+    }
+}

src/configs/config.py ADDED Viewed

	@@ -0,0 +1,21 @@

+from pydantic import BaseModel, Field
+class Config(BaseModel):
+    {
+    "tsac": {
+        "path": "fbougares/tsac",
+        "text_column": "sentence",
+        "label_column": "target",
+        "label_map": {0: 0, 1: 1},  # already binary
+        "trust_remote_code": True
+    },
+    "tunisian_sentiment": {
+        "path": "your-org/tunisian-sentiment",  # hypothetical
+        "text_column": "text",
+        "label_column": "label",
+        "label_map": {"negative": 0, "positive": 1, "neutral": -1},  # drop neutral
+        "trust_remote_code": False
+    },
+    # Add more as they become available
+}

src/evaluators/__init__.py ADDED Viewed

	@@ -0,0 +1,18 @@

+# src/evaluators/__init__.py
+from typing import Dict, Type
+from .base_evaluator import BaseEvaluator
+# Import all evaluators
+from .sentiment_analysis.evaluator import SentimentAnalysisEvaluator
+# from .tunisian_corpus_coverage import TunisianCorpusCoverageEvaluator
+# Add new ones here as you create them:
+from .normalization import NormalizationEvaluator
+from .transliteration import TransliterationEvaluator
+# Registry: task_name → Evaluator class
+EVALUATOR_REGISTRY: Dict[str, Type[BaseEvaluator]] = {
+    "Sentiment Analysis": SentimentAnalysisEvaluator,
+    # "Corpus Coverage": TunisianCorpusCoverageEvaluator,
+    "Normalization": NormalizationEvaluator,
+    "Transliteration": TransliterationEvaluator,
+}

src/evaluators/base_evaluator.py ADDED Viewed

	@@ -0,0 +1,17 @@

+# evaluators/base_evaluator.py
+from abc import ABC, abstractmethod
+from typing import Dict, Any
+class BaseEvaluator(ABC):
+    @abstractmethod
+    def load_dataset(self):
+        pass
+    @abstractmethod
+    def evaluate(self, model, tokenizer, device) -> Dict[str, Any]:
+        pass
+    @property
+    @abstractmethod
+    def task_name(self) -> str:
+        pass

src/{evaluator → evaluators}/evaluate.py RENAMED Viewed

@@ -5,14 +5,16 @@ from typing import Dict
 from dataclasses import dataclass
 from enum import Enum
 import torch
-from transformers import AutoModelForSequenceClassification, AutoTokenizer
 import traceback
-from src.envs import API, OWNER, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, RESULTS_REPO, QUEUE_REPO,TOKEN
-from src.evaluator.tunisian_corpus_coverage import evaluate_tunisian_corpus_coverage
-from src.evaluator.tsac import evaluate_tsac_sentiment
 class EvaluationStatus(Enum):
     PENDING = "PENDING"
     RUNNING = "RUNNING"
@@ -30,85 +32,66 @@ class EvaluationResult:
     error: str = None
 def evaluate_model(model_name: str, revision: str, precision: str, weight_type: str) -> EvaluationResult:
     """
-    Evaluates a single model on all defined tasks.
-    Args:
-        model_name (str): The name of the model on the Hugging Face Hub.
-        revision (str): The specific revision (commit hash or branch name) to use.
-        precision (str): The precision (e.g., 'float16') for model loading.
-        weight_type (str): The type of weights ('Original' or 'Adapter').
-    Returns:
-        EvaluationResult: A dataclass containing the evaluation results or an error message.
     """
     try:
-        print(f"\nStarting evaluation for model: {model_name} (revision: {revision}, precision: {precision}, weight_type: {weight_type})")
         device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-        print(f"Using device: {device}")
-        try:
-            print(f"\nLoading model and tokenizer for: {model_name}")
-            model = AutoModelForSequenceClassification.from_pretrained(
-                model_name,
-                revision=revision,
-                torch_dtype=getattr(torch, precision),
-                trust_remote_code=True
-            ).to(device)
-            tokenizer = AutoTokenizer.from_pretrained(model_name, revision=revision)
-            print(f"Successfully loaded model and tokenizer.")
-        except Exception as e:
-            error_msg = f"Failed to load model or tokenizer: {str(e)}"
-            print(f"Error: {error_msg}")
-            print(f"Full traceback: {traceback.format_exc()}")
-            return EvaluationResult(
-                model=model_name,
-                revision=revision,
-                precision=precision,
-                weight_type=weight_type,
-                results={},
-                error=error_msg
-            )
-        tsac_results = {"accuracy": None}
-        tunisian_results = {"coverage": None}
-        print("\nStarting TSAC sentiment evaluation...")
-        try:
-            tsac_results = evaluate_tsac_sentiment(model, tokenizer, device)
-            print(f"TSAC results: {tsac_results}")
-        except Exception as e:
-            print(f"Error in TSAC evaluation for {model_name}: {str(e)}")
-            print(f"Full traceback: {traceback.format_exc()}")
-        print("\nStarting Tunisian Corpus evaluation...")
-        try:
-            tunisian_results = evaluate_tunisian_corpus_coverage(model, tokenizer, device)
-            print(f"Tunisian Corpus results: {tunisian_results}")
-        except Exception as e:
-            print(f"Error in Tunisian Corpus evaluation for {model_name}: {str(e)}")
-            print(f"Full traceback: {traceback.format_exc()}")
-        print("\nEvaluation completed successfully!")
         return EvaluationResult(
             model=model_name,
             revision=revision,
             precision=precision,
             weight_type=weight_type,
-            results={
-                "accuracy": tsac_results.get("fbougares/tsac"),
-                "coverage": tunisian_results.get("arbml/Tunisian_Dialect_Corpus")
-            }
         )
     except Exception as e:
-        error_msg = f"An unexpected error occurred during evaluation: {str(e)}"
-        print(f"Error: {error_msg}")
-        print(f"Full traceback: {traceback.format_exc()}")
         return EvaluationResult(
             model=model_name,
             revision=revision,
@@ -152,7 +135,7 @@ def process_evaluation_queue():
     This function acts as a worker that finds a PENDING job, runs it,
     and updates the status on the Hugging Face Hub.
     """
-    print(f"\n=== Starting evaluation queue processing ===")
     print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
     print(f"Looking for evaluation requests in: {EVAL_REQUESTS_PATH}")
@@ -206,6 +189,8 @@ def process_evaluation_queue():
                         for v in eval_result.results.values():
                             if v is None:
                                 eval_result.error += f"Evaluation failed for {eval_entry['model']}: {v} is None"
                         print("\n=== Evaluation completed ===")

 from dataclasses import dataclass
 from enum import Enum
 import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer,AutoModel
 import traceback
+from src.evaluators import EVALUATOR_REGISTRY
+from src.evaluators.base_evaluator import BaseEvaluator
+from src.envs import API, EVAL_REQUESTS_PATH, RESULTS_REPO, QUEUE_REPO,TOKEN
+# from src.evaluators.tunisian_corpus_coverage import evaluate_tunisian_corpus_coverage
+from src.evaluators.sentiment_analysis.evaluator import SentimentAnalysisEvaluator
+sa_evaluator = SentimentAnalysisEvaluator()
 class EvaluationStatus(Enum):
     PENDING = "PENDING"
     RUNNING = "RUNNING"
     error: str = None
 def evaluate_model(model_name: str, revision: str, precision: str, weight_type: str) -> EvaluationResult:
     """
+    Evaluates a model on ALL registered tasks.
     """
     try:
+        print(f"\nStarting evaluation for model: {model_name}")
         device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        # Load model & tokenizer ONCE
+        print("Loading classification model and tokenizer...")
+        classification_model = AutoModelForSequenceClassification.from_pretrained(
+            model_name,
+            revision=revision,
+            torch_dtype=getattr(torch, precision),
+            trust_remote_code=True
+        ).to(device)
+        tokenizer = AutoTokenizer.from_pretrained(model_name, revision=revision)
+        print("✅ Classification Model loaded successfully.")
+        print("Loading base model...")
+        embdding_model = AutoModel.from_pretrained(
+            model_name,
+            revision=revision,
+            torch_dtype=getattr(torch, precision),
+            trust_remote_code=True
+        ).to(device)
+        print("✅ Embedding Model loaded successfully.")
+        all_results = {}
+        for task_name, EvaluatorClass in EVALUATOR_REGISTRY.items():
+            print(f"\n--- Evaluating: {task_name} ---")
+            try:
+                if task_name == "Sentiment Analysis":
+                    model = classification_model
+                elif task_name in ["Transliteration, Normalization"]:
+                    model = embdding_model
+                evaluator: BaseEvaluator = EvaluatorClass()
+                result = evaluator.evaluate(model, tokenizer, device=device)
+                # Extract main metric (must be in every evaluator)
+                all_results[task_name] = result["main_metric"]
+                print(f"✅ {task_name}: {result['main_metric']:.4f}")
+            except Exception as e:
+                error_msg = f"Failed {task_name}: {str(e)}"
+                print(f"❌ {error_msg}")
+                all_results[task_name] = None  # or skip
         return EvaluationResult(
             model=model_name,
             revision=revision,
             precision=precision,
             weight_type=weight_type,
+            results=all_results
         )
     except Exception as e:
+        error_msg = f"Critical failure: {str(e)}"
+        print(f"💥 {error_msg}")
         return EvaluationResult(
             model=model_name,
             revision=revision,
     This function acts as a worker that finds a PENDING job, runs it,
     and updates the status on the Hugging Face Hub.
     """
+    print("\n=== Starting evaluation queue processing ===")
     print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
     print(f"Looking for evaluation requests in: {EVAL_REQUESTS_PATH}")
                         for v in eval_result.results.values():
                             if v is None:
+                                if eval_result.error is None:
+                                    eval_result.error = ""
                                 eval_result.error += f"Evaluation failed for {eval_entry['model']}: {v} is None"
                         print("\n=== Evaluation completed ===")

src/evaluators/madar_tun.py ADDED Viewed

	@@ -0,0 +1,108 @@

+import torch
+from datasets import load_dataset
+# from transformers import AutoTokenizer, AutoModel
+from sklearn.metrics import accuracy_score
+# import argparse
+import warnings
+warnings.filterwarnings("ignore")
+def load_and_prepare_data():
+    """Load MADAR-TUN and prepare normalization & transliteration pairs."""
+    print("Loading MADAR-TUN dataset...")
+    ds = load_dataset("tunis-ai/MADAR-TUN", split="train")
+    valid_examples = [
+        ex for ex in ds
+        if ex["arabish"] != "<eos>"
+        and ex["words"] != "<eos>"
+        and ex["lem"] != "<eos>"
+        and ex["arabish"] is not None
+        and ex["arabish"].strip()
+        and ex["words"] is not None
+        and ex["words"].strip()
+        and ex["lem"] is not None
+        and ex["lem"].strip()
+    ]
+    print(f"Loaded {len(valid_examples)} valid token entries.")
+    # Build unique pairs (deduplicate)
+    norm_pairs = {}  # arabish -> canonical lemma
+    trans_pairs = {}  # arabish <-> arabic
+    for ex in valid_examples:
+        arabizi = ex["arabish"]
+        arabic = ex["words"]
+        lemma = ex["lem"]
+        # For normalization: use lemma as canonical form
+        if arabizi not in norm_pairs:
+            norm_pairs[arabizi] = lemma
+        if arabizi not in trans_pairs:
+            trans_pairs[arabizi] = arabic
+    print(f"Normalization pairs: {len(norm_pairs)}")
+    print(f"Transliteration pairs: {len(trans_pairs)}")
+    return norm_pairs, trans_pairs
+def evaluate_word_classification(model, tokenizer, word_pairs, device, task_name):
+    """
+    Evaluate word-level classification (normalization or transliteration).
+    Treats it as closed-vocabulary classification via embedding similarity.
+    """
+    words = list(word_pairs.keys())
+    targets = list(word_pairs.values())
+    # Build target vocabulary
+    unique_targets = sorted(set(targets))
+    target_to_id = {t: i for i, t in enumerate(unique_targets)}
+    _target_ids = [target_to_id[t] for t in targets]
+    print(f"\n[{task_name}] Vocabulary size: {len(unique_targets)}")
+    print(f"[{task_name}] Evaluation samples: {len(words)}")
+    # Get embeddings for all target forms
+    print(f"[{task_name}] Encoding target vocabulary...")
+    target_encodings = tokenizer(
+        unique_targets,
+        padding=True,
+        truncation=True,
+        max_length=32,
+        return_tensors="pt"
+    ).to(device)
+    with torch.no_grad():
+        target_embeds = model(**target_encodings).last_hidden_state[:, 0]  # [V, H]
+    # Predict for each input word
+    predictions = []
+    batch_size = 32
+    print(f"[{task_name}] Predicting...")
+    for i in range(0, len(words), batch_size):
+        batch_words = words[i:i+batch_size]
+        inputs = tokenizer(
+            batch_words,
+            padding=True,
+            truncation=True,
+            max_length=32,
+            return_tensors="pt"
+        ).to(device)
+        with torch.no_grad():
+            word_embeds = model(**inputs).last_hidden_state[:, 0]  # [B, H]
+            logits = torch.matmul(word_embeds, target_embeds.T)  # [B, V]
+            preds = logits.argmax(dim=1).cpu().tolist()
+            predictions.extend(preds)
+    # Map back to target IDs
+    true_labels = [target_to_id[t] for t in targets]
+    acc = accuracy_score(true_labels, predictions)
+    print(f"[{task_name}] Accuracy: {acc:.4f}")
+    return acc

src/evaluators/normalization/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .evaluator import NormalizationEvaluator

src/evaluators/normalization/datasets.py ADDED Viewed

	@@ -0,0 +1,10 @@

+# src/evaluators/normalization/datasets.py
+NORMALIZATION_DATASETS = {
+    "madar-tun": {
+        "path": "tunis-ai/MADAR-TUN",
+        "split": "test",  # or "test" if available
+        "arabish_col": "arabish",
+        "canonical_col": "lem",  # could also be "words"
+        "description": "MADAR-TUN: Arabizi → Lemma normalization"
+    }
+}

src/evaluators/normalization/evaluator.py ADDED Viewed

	@@ -0,0 +1,96 @@

+# src/evaluators/normalization/evaluator.py
+import torch
+from datasets import load_dataset
+from sklearn.metrics import accuracy_score
+from typing import Dict, Any
+import warnings
+from ..base_evaluator import BaseEvaluator
+from .datasets import NORMALIZATION_DATASETS
+warnings.filterwarnings("ignore")
+class NormalizationEvaluator(BaseEvaluator):
+    def __init__(self, dataset_key: str = "madar-tun", max_samples: int = None):
+        if dataset_key not in NORMALIZATION_DATASETS:
+            raise ValueError(f"Unknown dataset: {dataset_key}")
+        self.config = NORMALIZATION_DATASETS[dataset_key]
+        self.max_samples = max_samples
+    @property
+    def task_name(self) -> str:
+        return "Normalization"
+    def load_dataset(self):
+        print(f"\nLoading normalization data from {self.config['path']}...")
+        ds = load_dataset(
+            self.config["path"],
+            split=self.config["split"]
+        )
+        valid = []
+        for ex in ds:
+            a = ex[self.config["arabish_col"]]
+            c = ex[self.config["canonical_col"]]
+            if a and c and a != "<eos>" and c != "<eos>" and a is not None and a.strip() and c is not None and c.strip():
+                valid.append((a.strip(), c.strip()))
+        if self.max_samples:
+            valid = valid[:self.max_samples]
+        print(f"Loaded {len(valid)} normalization pairs.")
+        return valid  # List[Tuple[noisy, canonical]]
+    def evaluate(self, model, tokenizer, device: str = "cuda") -> Dict[str, Any]:
+        pairs = self.load_dataset()
+        if not pairs:
+            raise ValueError("No valid normalization pairs found!")
+        words, targets = zip(*pairs)
+        words, targets = list(words), list(targets)
+        # Build vocab
+        unique_targets = sorted(set(targets))
+        target_to_id = {t: i for i, t in enumerate(unique_targets)}
+        # Encode targets
+        target_enc = tokenizer(
+            unique_targets,
+            padding=True,
+            truncation=True,
+            max_length=32,
+            return_tensors="pt"
+        ).to(device)
+        with torch.no_grad():
+            target_embeds = model(**target_enc).last_hidden_state[:, 0]
+        # Predict
+        predictions = []
+        batch_size = 32
+        for i in range(0, len(words), batch_size):
+            batch = words[i:i+batch_size]
+            inputs = tokenizer(
+                batch,
+                padding=True,
+                truncation=True,
+                max_length=32,
+                return_tensors="pt"
+            ).to(device)
+            with torch.no_grad():
+                word_embeds = model(**inputs).last_hidden_state[:, 0]
+                logits = torch.matmul(word_embeds, target_embeds.T)
+                preds = logits.argmax(dim=1).cpu().tolist()
+                predictions.extend(preds)
+        true_labels = [target_to_id[t] for t in targets]
+        acc = accuracy_score(true_labels, predictions)
+        print(f"✅ Normalization Accuracy: {acc:.4f}")
+        return {
+            "task": self.task_name,
+            "main_metric": acc,
+            "accuracy": acc,
+            "total_samples": len(pairs)
+        }

src/{evaluator → evaluators}/run_evaluator.py RENAMED Viewed

@@ -1,5 +1,5 @@
 import time
-from src.evaluator.evaluate import process_evaluation_queue
 def evaluator_runner():

 import time
+from src.evaluators.evaluate import process_evaluation_queue
 def evaluator_runner():

src/evaluators/sentiment_analysis/__init__.py ADDED Viewed

File without changes

src/evaluators/sentiment_analysis/dataset.py ADDED Viewed

File without changes

src/evaluators/sentiment_analysis/evaluator.py ADDED Viewed

	@@ -0,0 +1,207 @@

+import torch
+from torch.utils.data import DataLoader
+from datasets import concatenate_datasets, load_dataset,Dataset
+from typing import Dict, Any, List, Optional
+import warnings
+from ..base_evaluator import BaseEvaluator
+SUPPORTED_DATASETS = {
+    "tsac": {
+        "path": "tunis-ai/tsac",
+        "text_column": "sentence",
+        "label_column": "target",
+        "label_map": {0: 0, 1: 1},  # already binary
+        "trust_remote_code": True,
+        "split": "test"
+    },
+}
+class SentimentAnalysisEvaluator(BaseEvaluator):
+    """
+    Unified evaluator for Tunisian sentiment analysis.
+    Supports multiple datasets, harmonizes labels to binary (0=neg, 1=pos).
+    Neutral/mapped-to-invalid labels are filtered out.
+    """
+    def __init__(
+        self,
+        datasets: Optional[List[str]] = None,
+        max_samples_per_dataset: int = 500,
+        batch_size: int = 16
+    ):
+        """
+        Args:
+            datasets: List of dataset keys from SUPPORTED_DATASETS.
+                      If None, uses all available.
+            max_samples_per_dataset: Limit samples per dataset for faster eval.
+            batch_size: Inference batch size.
+        """
+        if datasets is None:
+            self.dataset_keys = list(SUPPORTED_DATASETS.keys())
+        else:
+            for d in datasets:
+                if d not in SUPPORTED_DATASETS:
+                    raise ValueError(f"Dataset '{d}' not in supported list: {list(SUPPORTED_DATASETS.keys())}")
+            self.dataset_keys = datasets
+        self.max_samples_per_dataset = max_samples_per_dataset
+        self.batch_size = batch_size
+    @property
+    def task_name(self) -> str:
+        return "Sentiment Analysis"
+    def load_dataset(self) -> Dataset:
+        """Load and harmonize all configured sentiment datasets."""
+        print("\n=== Loading Tunisian Sentiment Datasets ===")
+        all_datasets = []
+        for key in self.dataset_keys:
+            cfg = SUPPORTED_DATASETS[key]
+            print(f"\nLoading '{key}': {cfg.get('description', "No description available.")}")
+            try:
+                ds = load_dataset(
+                    cfg["path"],
+                    split=cfg["split"],
+                    trust_remote_code=cfg.get("trust_remote_code", False)
+                )
+                print(f"  Raw size: {len(ds)}")
+            except Exception as e:
+                warnings.warn(f"Failed to load {key}: {e}. Skipping.")
+                continue
+            # Harmonize to {"text": str, "label": int in {0,1}}
+            def harmonize(example):
+                # print(cfg)
+                try:
+                    text = example[cfg["text_column"]]
+                    orig_label = example[cfg["label_column"]]
+                    if orig_label not in cfg["label_map"]:
+                        return None
+                    new_label = cfg["label_map"][orig_label]
+                    if new_label not in [0, 1]:
+                        return None  # skip neutral/invalid
+                    return {"text": text, "label": new_label}
+                except Exception:
+                    return None
+            print("  Harmonizing and filtering...")
+            ds = ds.map(
+                harmonize,
+                load_from_cache_file=False,
+                desc=f"Harmonizing {key}"
+            )
+            # print(ds)
+            print("  Filtering invalid/neutral samples...")
+            ds = ds.filter(lambda x: x is not None, load_from_cache_file=False)
+            print(f"  Valid binary samples: {len(ds)}")
+            if self.max_samples_per_dataset and len(ds) > self.max_samples_per_dataset:
+                ds = ds.select(range(self.max_samples_per_dataset))
+                print(f"  Trimmed to {self.max_samples_per_dataset} samples")
+            if len(ds) > 0:
+                all_datasets.append(ds)
+        if not all_datasets:
+            raise ValueError("No valid sentiment data found!")
+        # Combine all datasets
+        combined = concatenate_datasets(all_datasets)
+        print(f"\n✅ Total Tunisian sentiment samples: {len(combined)}")
+        return combined
+    def _tokenize_batch(self, examples, tokenizer):
+        return tokenizer(
+            examples["sentence"],
+            padding=True,
+            truncation=True,
+            max_length=512,
+            return_tensors=None
+        )
+    def _collate_fn(self, batch):
+        input_ids = torch.stack([torch.tensor(b["input_ids"]) for b in batch])
+        attention_mask = torch.stack([torch.tensor(b["attention_mask"]) for b in batch])
+        labels = torch.tensor([b["labels"] for b in batch], dtype=torch.long)
+        return {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "labels": labels
+        }
+    def evaluate(self, model, tokenizer, device: str = "cuda") -> Dict[str, Any]:
+        """Evaluate model on unified Tunisian sentiment task."""
+        print(f"\n=== Evaluating {self.task_name} ===")
+        print(f"Model: {model.__class__.__name__} | Device: {device}")
+        print(f"Datasets: {self.dataset_keys}")
+        # Load and prepare data
+        raw_dataset = self.load_dataset()
+        tokenized = raw_dataset.map(
+            lambda ex: self._tokenize_batch(ex, tokenizer),
+            batched=True,
+            remove_columns=raw_dataset.column_names
+        )
+        tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"])
+        tokenized = tokenized.add_column("labels", raw_dataset["label"])
+        print(tokenized.column_names)
+        dataloader = DataLoader(
+            tokenized,
+            batch_size=self.batch_size,
+            shuffle=False,
+            collate_fn=self._collate_fn
+        )
+        # Inference
+        model.eval()
+        all_preds, all_labels = [], []
+        with torch.no_grad():
+            for i, batch in enumerate(dataloader):
+                inputs = {
+                    k: v.to(device) for k, v in batch.items()
+                    if k in ["input_ids", "attention_mask"]
+                }
+                labels = batch["labels"].to(device)
+                outputs = model(**inputs)
+                logits = outputs.logits if hasattr(outputs, "logits") else outputs[0]
+                if logits.dim() == 3:  # [B, L, C]
+                    logits = logits[:, 0, :]
+                preds = logits.argmax(dim=-1).cpu().tolist()
+                trues = labels.cpu().tolist()
+                all_preds.extend(preds)
+                all_labels.extend(trues)
+        # Metrics
+        correct = sum(p == t for p, t in zip(all_preds, all_labels))
+        total = len(all_preds)
+        accuracy = correct / total if total > 0 else 0.0
+        print(f"\n✅ {self.task_name} Results:")
+        print(f"  Accuracy: {accuracy:.4f} ({correct}/{total})")
+        return {
+            "task": self.task_name,
+            "accuracy": accuracy,
+            "main_metric": accuracy,
+            "total_samples": total,
+            "datasets_used": self.dataset_keys
+        }

src/evaluators/transliteration/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .evaluator import TransliterationEvaluator

src/evaluators/transliteration/datasets.py ADDED Viewed

	@@ -0,0 +1,10 @@

+# src/evaluators/transliteration/datasets.py
+TRANSLITERATION_DATASETS = {
+    "madar-tun": {
+        "path": "tunis-ai/MADAR-TUN",
+        "split": "test",
+        "source_col": "arabish",   # Latin
+        "target_col": "words",     # Arabic script
+        "description": "MADAR-TUN: Arabizi ↔ Arabic script"
+    }
+}

src/evaluators/transliteration/evaluator.py ADDED Viewed

	@@ -0,0 +1,96 @@

+# src/evaluators/transliteration/evaluator.py
+import torch
+from datasets import load_dataset
+from sklearn.metrics import accuracy_score
+from typing import Dict, Any
+import warnings
+from ..base_evaluator import BaseEvaluator
+from .datasets import TRANSLITERATION_DATASETS
+warnings.filterwarnings("ignore")
+class TransliterationEvaluator(BaseEvaluator):
+    def __init__(self, dataset_key: str = "madar-tun", max_samples: int = None):
+        if dataset_key not in TRANSLITERATION_DATASETS:
+            raise ValueError(f"Unknown dataset: {dataset_key}")
+        self.config = TRANSLITERATION_DATASETS[dataset_key]
+        self.max_samples = max_samples
+    @property
+    def task_name(self) -> str:
+        return "Transliteration"
+    def load_dataset(self):
+        print(f"\nLoading transliteration data from {self.config['path']}...")
+        ds = load_dataset(
+            self.config["path"],
+            split=self.config["split"]
+        )
+        valid = []
+        for ex in ds:
+            src = ex[self.config["source_col"]]
+            tgt = ex[self.config["target_col"]]
+            if src and tgt and src != "<eos>" and tgt != "<eos>" and src.strip() and tgt.strip():
+                valid.append((src.strip(), tgt.strip()))
+        if self.max_samples:
+            valid = valid[:self.max_samples]
+        print(f"Loaded {len(valid)} transliteration pairs.")
+        return valid
+    def evaluate(self, model, tokenizer, device: str = "cuda") -> Dict[str, Any]:
+        pairs = self.load_dataset()
+        if not pairs:
+            raise ValueError("No valid transliteration pairs found!")
+        sources, targets = zip(*pairs)
+        sources, targets = list(sources), list(targets)
+        # Build target vocab
+        unique_targets = sorted(set(targets))
+        target_to_id = {t: i for i, t in enumerate(unique_targets)}
+        # Encode targets
+        target_enc = tokenizer(
+            unique_targets,
+            padding=True,
+            truncation=True,
+            max_length=32,
+            return_tensors="pt"
+        ).to(device)
+        with torch.no_grad():
+            target_embeds = model(**target_enc).last_hidden_state[:, 0]
+        # Predict
+        predictions = []
+        batch_size = 32
+        for i in range(0, len(sources), batch_size):
+            batch = sources[i:i+batch_size]
+            inputs = tokenizer(
+                batch,
+                padding=True,
+                truncation=True,
+                max_length=32,
+                return_tensors="pt"
+            ).to(device)
+            with torch.no_grad():
+                src_embeds = model(**inputs).last_hidden_state[:, 0]
+                logits = torch.matmul(src_embeds, target_embeds.T)
+                preds = logits.argmax(dim=1).cpu().tolist()
+                predictions.extend(preds)
+        true_labels = [target_to_id[t] for t in targets]
+        acc = accuracy_score(true_labels, predictions)
+        print(f"✅ Transliteration Accuracy: {acc:.4f}")
+        return {
+            "task": self.task_name,
+            "main_metric": acc,
+            "accuracy": acc,
+            "total_samples": len(pairs)
+        }

src/evaluators/tsac.py ADDED Viewed

	@@ -0,0 +1,133 @@

+import torch
+from datasets import load_dataset
+import traceback
+import time
+def evaluate_tsac_sentiment(model, tokenizer, device):
+    """Evaluate model on TSAC sentiment analysis task"""
+    try:
+        print("\n=== Starting TSAC sentiment evaluation ===")
+        print(f"Current device: {device}")
+        # Load and preprocess dataset
+        print("\nLoading and preprocessing TSAC dataset...")
+        dataset = load_dataset("fbougares/tsac", split="test", trust_remote_code=True)
+        dataset = dataset.select(range(10))  # Only evaluate on 200 samples
+        # print(f"Dataset size: {len(dataset)} examples")
+        def preprocess(examples):
+            return tokenizer(
+                examples['sentence'],
+                padding=True,
+                truncation=True,
+                max_length=512,
+                return_tensors=None
+            )
+        print(dataset.column_names)
+        dataset = dataset.map(preprocess, batched=True)
+        dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'target'])
+        # Check first example
+        first_example = dataset[0]
+        print("\nFirst example details:")
+        print(f"Input IDs shape: {first_example['input_ids'].shape}")
+        print(f"Attention mask shape: {first_example['attention_mask'].shape}")
+        print(f"Target: {first_example['target']}")
+        model.eval()
+        print(f"\nModel class: {model.__class__.__name__}")
+        print(f"Model device: {next(model.parameters()).device}")
+        with torch.no_grad():
+            predictions = []
+            targets = []
+            # Create DataLoader with batch size 16
+            from torch.utils.data import DataLoader
+            # Define a custom collate function
+            def collate_fn(batch):
+                input_ids = torch.stack([sample['input_ids'] for sample in batch])
+                attention_mask = torch.stack([sample['attention_mask'] for sample in batch])
+                targets = torch.stack([sample['target'] for sample in batch])
+                return {
+                    'input_ids': input_ids,
+                    'attention_mask': attention_mask,
+                    'target': targets
+                }
+            dataloader = DataLoader(
+                dataset,
+                batch_size=16,
+                shuffle=False,
+                collate_fn=collate_fn
+            )
+            for i, batch in enumerate(dataloader):
+                if i % 10 == 0 :
+                    print("\nProcessing first batch...")
+                    print(f"Batch keys: {list(batch.keys())}")
+                    print(f"Target shape: {batch['target'].shape}")
+                inputs = {k: v.to(device) for k, v in batch.items() if k != 'target'}
+                target = batch['target'].to(device)
+                before = time.time()
+                outputs = model(**inputs)
+                # print(f"\nBatch {i} output type: {type(outputs)}")
+                # Handle different model output formats
+                if isinstance(outputs, dict):
+                    # print(f"Output keys: {list(outputs.keys())}")
+                    if 'logits' in outputs:
+                        logits = outputs['logits']
+                    elif 'prediction_logits' in outputs:
+                        logits = outputs['prediction_logits']
+                    else:
+                        raise ValueError(f"Unknown output format. Available keys: {list(outputs.keys())}")
+                elif isinstance(outputs, tuple):
+                    print(f"Output tuple length: {len(outputs)}")
+                    logits = outputs[0]
+                else:
+                    logits = outputs
+                # print(f"Logits shape: {logits.shape}")
+                # For sequence classification, we typically use the [CLS] token's prediction
+                if len(logits.shape) == 3:  # [batch_size, sequence_length, num_classes]
+                    logits = logits[:, 0, :]  # Take the [CLS] token prediction
+                # print(f"Final logits shape: {logits.shape}")
+                batch_predictions = logits.argmax(dim=-1).cpu().tolist()
+                batch_targets = target.cpu().tolist()
+                predictions.extend(batch_predictions)
+                targets.extend(batch_targets)
+                if i % 10 == 0:
+                    print("\nFirst batch predictions:")
+                    print(f"Predictions: {batch_predictions[:5]}")
+                    print(f"Targets: {batch_targets[:5]}")
+            print(f"\nTotal predictions: {len(predictions)}")
+            print(f"Total targets: {len(targets)}")
+            # Calculate accuracy
+            correct = sum(p == t for p, t in zip(predictions, targets))
+            total = len(predictions)
+            accuracy = correct / total if total > 0 else 0.0
+            print(f"\nEvaluation results:")
+            print(f"Correct predictions: {correct}")
+            print(f"Total predictions: {total}")
+            print(f"Accuracy: {accuracy:.4f}")
+            return {"fbougares/tsac": accuracy}
+    except Exception as e:
+        print(f"\n=== Error in TSAC evaluation: {str(e)} ===")
+        print(f"Full traceback: {traceback.format_exc()}")
+        raise e

src/{evaluator → evaluators}/tunisian_corpus_coverage.py RENAMED Viewed

File without changes

src/submission/submit.py CHANGED Viewed

@@ -12,7 +12,7 @@ from src.submission.check_validity import (
     get_model_size,
     is_model_on_hub,
 )
-from src.evaluator.evaluate import EvaluationStatus
 REQUESTED_MODELS = None

     get_model_size,
     is_model_on_hub,
 )
+from src.evaluators.evaluate import EvaluationStatus
 REQUESTED_MODELS = None