File size: 6,392 Bytes
d359910 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
license: cc-by-4.0
datasets:
- DSL-13-SRMAP/TeSent_Benchmark-Dataset
language:
- te
---
# Multilingual Sentiment Classification & Explanation Pipeline
This repository provides a full pipeline for training, tuning, and evaluating multilingual sentiment classification models (with a focus on Telugu text and Indian languages) using both standard and rationale-supervised approaches. The pipeline employs human-annotated rationales and the FERRET framework to assess model explanations for both **faithfulness** and **plausibility**.
---
## Table of Contents
- [Project Overview](#project-overview)
- [Dataset Format](#dataset-format)
- [Model Selection](#model-selection)
- [Pipeline Steps](#pipeline-steps)
- [1. Hyperparameter Tuning](#1-hyperparameter-tuning)
- [2. Model Training](#2-model-training)
- [3. FERRET Faithfulness Evaluation](#3-ferret-faithfulness-evaluation)
- [4. FERRET Plausibility Evaluation](#4-ferret-plausibility-evaluation)
- [Metric Aggregation](#metric-aggregation)
- [How to Run](#how-to-run)
- [Outputs](#outputs)
- [Citation](#citation)
- [Contact](#contact)
---
## Project Overview
This pipeline supports:
- **Hyperparameter tuning** for both attention-supervised (with rationale) and standard (without rationale) models.
- **Model training** for both approaches.
- **Faithfulness evaluation** using FERRET to measure how well explanations justify model predictions.
- **Plausibility evaluation** using FERRET to measure how closely model explanations align with human rationales.
- **Metric aggregation** for reporting in papers, using annotator-wise and sentence-wise averages.
---
## Dataset Format
The dataset must be in CSV format, with the following columns:
| Content | Annotations | Rationale | Label |
|---------|-------------|-----------|-------|
| Text (Telugu/Indian) | Annotators' sentiment labels (pipe-separated) | Rationale spans (pipe-separated, comma-separated) | Final label |
**Example:**
| Content | Annotations | Rationale | Label |
|---------|-------------|-----------|-------|
| గేలుపు దీశగా అందరికీ అదరగొట్టిన అక్క | Positive\|Positive\|Neutral | గేలుపు,దీశగా,అదరగొట్టిన\|గేలుపు\| | Positive |
---
## Model Selection
Models considered for training and evaluation:
1. **bert-base-multilingual-cased** (used for tuning and baseline)
2. **ai4bharat/IndicBERTv2-MLM-only**
3. **google/muril-base-cased**
4. **FacebookAI/xlm-roberta-base**
5. **l3cube-pune/telugu-bert**
---
## Pipeline Steps
### 1. Hyperparameter Tuning
**Scripts:**
- With rationale: `hyperparameter_tuning_for_rationale.py`
- Without rationale: `hyperparameter_tuning_without_rationale.py`
- Grid search over learning rate, batch size, and (for rationale models) rationale loss weight (`lambda`).
- Conducted separately for models trained **with** and **without** human rationale supervision.
- Results are saved as CSVs with detailed metrics for each configuration.
### 2. Model Training
**Scripts:**
- With rationale: `model_training_with_rationale.py`
- Without rationale: `model_training_without_rationale.py`
- Trains models using selected hyperparameters from tuning.
- Both approaches (with and without rationale supervision) are supported.
- Trained models and tokenizers are saved for downstream evaluation.
### 3. FERRET Faithfulness Evaluation
**Script:** `ferret_faithfullness.py`
**Input:** Predictions and explanations from trained models.
- Runs model prediction on the test set.
- Retains only "matched" samples (where prediction equals ground-truth label).
- Generates and evaluates FERRET explanations for faithfulness:
- Faithfulness metrics reflect how well the explanation supports the model's own prediction.
- **Metric aggregation:**
- The average of each faithfulness metric **over all sentences** gives the value reported in papers.
**Output:** `<model_name>_ferret_matched.csv` (faithfulness metrics per sentence).
### 4. FERRET Plausibility Evaluation
**Script:** `ferret_plausibility.py`
**Input:** Output file from Step 3 (`<model_name>_ferret_matched.csv`).
- For each matched sample:
- Generates attention vectors from human rationales (for each annotator).
- Evaluates FERRET explanations for plausibility against each annotator's rationale using metrics such as AUPRC, token-wise F1, and IoU.
- **Metric aggregation:**
- For each metric, average **over all annotators and all sentences** is computed.
- These averages are the plausibility scores presented in papers.
**Output:** `<model_name>_ferret_plausibility.csv` (plausibility metrics per sentence and annotator).
---
## Metric Aggregation
- **Faithfulness Metrics:**
- For each metric in `<model_name>_ferret_matched.csv`, compute the average **across all sentences**.
- These are reported as overall faithfulness scores.
- **Plausibility Metrics:**
- For each metric in `<model_name>_ferret_plausibility.csv`, compute the average **across all annotators and all sentences**.
- These are reported as overall plausibility scores (per metric).
---
## How to Run
1. **Prepare dataset:** Format train, validation, and test CSVs as described above.
2. **Add emoji vocabulary:** Place `emoji.csv` in the project root.
3. **Hyperparameter tuning:**
```bash
python hyperparameter_tuning_for_rationale.py
python hyperparameter_tuning_without_rationale.py
```
4. **Train final models:**
```bash
python model_training_with_rationale.py
python model_training_without_rationale.py
```
5. **FERRET Faithfulness evaluation:**
```bash
python ferret_faithfullness.py
```
6. **FERRET Plausibility evaluation:**
```bash
python ferret_plausibility.py
```
*Edit script configs (model names, paths, batch sizes) as needed.*
---
## Outputs
- **Hyperparameter tuning results:** `grid_results_detailed.csv`
- **Model training:** Model weights, tokenizer, and metric CSVs.
- **Faithfulness metrics:** `<model_name>_ferret_matched.csv`
- **Plausibility metrics:** `<model_name>_ferret_plausibility.csv`
- **Test metrics & predictions:** `overall_test_metrics.csv`, `labelwise_test_metrics.csv`, `test_predictions.csv`, `confusion_matrix.csv`, `confusion_matrix.png`
- **Metric averages:** Compute using provided scripts or pandas for reporting.
---
|