--- license: cc-by-4.0 datasets: - DSL-13-SRMAP/TeSent_Benchmark-Dataset language: - te --- # Multilingual Sentiment Classification & Explanation Pipeline This repository provides a full pipeline for training, tuning, and evaluating multilingual sentiment classification models (with a focus on Telugu text and Indian languages) using both standard and rationale-supervised approaches. The pipeline employs human-annotated rationales and the FERRET framework to assess model explanations for both **faithfulness** and **plausibility**. --- ## Table of Contents - [Project Overview](#project-overview) - [Dataset Format](#dataset-format) - [Model Selection](#model-selection) - [Pipeline Steps](#pipeline-steps) - [1. Hyperparameter Tuning](#1-hyperparameter-tuning) - [2. Model Training](#2-model-training) - [3. FERRET Faithfulness Evaluation](#3-ferret-faithfulness-evaluation) - [4. FERRET Plausibility Evaluation](#4-ferret-plausibility-evaluation) - [Metric Aggregation](#metric-aggregation) - [How to Run](#how-to-run) - [Outputs](#outputs) - [Citation](#citation) - [Contact](#contact) --- ## Project Overview This pipeline supports: - **Hyperparameter tuning** for both attention-supervised (with rationale) and standard (without rationale) models. - **Model training** for both approaches. - **Faithfulness evaluation** using FERRET to measure how well explanations justify model predictions. - **Plausibility evaluation** using FERRET to measure how closely model explanations align with human rationales. - **Metric aggregation** for reporting in papers, using annotator-wise and sentence-wise averages. --- ## Dataset Format The dataset must be in CSV format, with the following columns: | Content | Annotations | Rationale | Label | |---------|-------------|-----------|-------| | Text (Telugu/Indian) | Annotators' sentiment labels (pipe-separated) | Rationale spans (pipe-separated, comma-separated) | Final label | **Example:** | Content | Annotations | Rationale | Label | |---------|-------------|-----------|-------| | గేలుపు దీశగా అందరికీ అదరగొట్టిన అక్క | Positive\|Positive\|Neutral | గేలుపు,దీశగా,అదరగొట్టిన\|గేలుపు\| | Positive | --- ## Model Selection Models considered for training and evaluation: 1. **bert-base-multilingual-cased** (used for tuning and baseline) 2. **ai4bharat/IndicBERTv2-MLM-only** 3. **google/muril-base-cased** 4. **FacebookAI/xlm-roberta-base** 5. **l3cube-pune/telugu-bert** --- ## Pipeline Steps ### 1. Hyperparameter Tuning **Scripts:** - With rationale: `hyperparameter_tuning_for_rationale.py` - Without rationale: `hyperparameter_tuning_without_rationale.py` - Grid search over learning rate, batch size, and (for rationale models) rationale loss weight (`lambda`). - Conducted separately for models trained **with** and **without** human rationale supervision. - Results are saved as CSVs with detailed metrics for each configuration. ### 2. Model Training **Scripts:** - With rationale: `model_training_with_rationale.py` - Without rationale: `model_training_without_rationale.py` - Trains models using selected hyperparameters from tuning. - Both approaches (with and without rationale supervision) are supported. - Trained models and tokenizers are saved for downstream evaluation. ### 3. FERRET Faithfulness Evaluation **Script:** `ferret_faithfullness.py` **Input:** Predictions and explanations from trained models. - Runs model prediction on the test set. - Retains only "matched" samples (where prediction equals ground-truth label). - Generates and evaluates FERRET explanations for faithfulness: - Faithfulness metrics reflect how well the explanation supports the model's own prediction. - **Metric aggregation:** - The average of each faithfulness metric **over all sentences** gives the value reported in papers. **Output:** `_ferret_matched.csv` (faithfulness metrics per sentence). ### 4. FERRET Plausibility Evaluation **Script:** `ferret_plausibility.py` **Input:** Output file from Step 3 (`_ferret_matched.csv`). - For each matched sample: - Generates attention vectors from human rationales (for each annotator). - Evaluates FERRET explanations for plausibility against each annotator's rationale using metrics such as AUPRC, token-wise F1, and IoU. - **Metric aggregation:** - For each metric, average **over all annotators and all sentences** is computed. - These averages are the plausibility scores presented in papers. **Output:** `_ferret_plausibility.csv` (plausibility metrics per sentence and annotator). --- ## Metric Aggregation - **Faithfulness Metrics:** - For each metric in `_ferret_matched.csv`, compute the average **across all sentences**. - These are reported as overall faithfulness scores. - **Plausibility Metrics:** - For each metric in `_ferret_plausibility.csv`, compute the average **across all annotators and all sentences**. - These are reported as overall plausibility scores (per metric). --- ## How to Run 1. **Prepare dataset:** Format train, validation, and test CSVs as described above. 2. **Add emoji vocabulary:** Place `emoji.csv` in the project root. 3. **Hyperparameter tuning:** ```bash python hyperparameter_tuning_for_rationale.py python hyperparameter_tuning_without_rationale.py ``` 4. **Train final models:** ```bash python model_training_with_rationale.py python model_training_without_rationale.py ``` 5. **FERRET Faithfulness evaluation:** ```bash python ferret_faithfullness.py ``` 6. **FERRET Plausibility evaluation:** ```bash python ferret_plausibility.py ``` *Edit script configs (model names, paths, batch sizes) as needed.* --- ## Outputs - **Hyperparameter tuning results:** `grid_results_detailed.csv` - **Model training:** Model weights, tokenizer, and metric CSVs. - **Faithfulness metrics:** `_ferret_matched.csv` - **Plausibility metrics:** `_ferret_plausibility.csv` - **Test metrics & predictions:** `overall_test_metrics.csv`, `labelwise_test_metrics.csv`, `test_predictions.csv`, `confusion_matrix.csv`, `confusion_matrix.png` - **Metric averages:** Compute using provided scripts or pandas for reporting. ---