Telugu
File size: 6,392 Bytes
d359910
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
license: cc-by-4.0
datasets:
- DSL-13-SRMAP/TeSent_Benchmark-Dataset
language:
- te
---
# Multilingual Sentiment Classification & Explanation Pipeline

This repository provides a full pipeline for training, tuning, and evaluating multilingual sentiment classification models (with a focus on Telugu text and Indian languages) using both standard and rationale-supervised approaches. The pipeline employs human-annotated rationales and the FERRET framework to assess model explanations for both **faithfulness** and **plausibility**.

---

## Table of Contents

- [Project Overview](#project-overview)
- [Dataset Format](#dataset-format)
- [Model Selection](#model-selection)
- [Pipeline Steps](#pipeline-steps)
  - [1. Hyperparameter Tuning](#1-hyperparameter-tuning)
  - [2. Model Training](#2-model-training)
  - [3. FERRET Faithfulness Evaluation](#3-ferret-faithfulness-evaluation)
  - [4. FERRET Plausibility Evaluation](#4-ferret-plausibility-evaluation)
- [Metric Aggregation](#metric-aggregation)
- [How to Run](#how-to-run)
- [Outputs](#outputs)
- [Citation](#citation)
- [Contact](#contact)

---

## Project Overview

This pipeline supports:

- **Hyperparameter tuning** for both attention-supervised (with rationale) and standard (without rationale) models.
- **Model training** for both approaches.
- **Faithfulness evaluation** using FERRET to measure how well explanations justify model predictions.
- **Plausibility evaluation** using FERRET to measure how closely model explanations align with human rationales.
- **Metric aggregation** for reporting in papers, using annotator-wise and sentence-wise averages.

---

## Dataset Format

The dataset must be in CSV format, with the following columns:

| Content | Annotations | Rationale | Label |
|---------|-------------|-----------|-------|
| Text (Telugu/Indian) | Annotators' sentiment labels (pipe-separated) | Rationale spans (pipe-separated, comma-separated) | Final label |

**Example:**

| Content | Annotations | Rationale | Label |
|---------|-------------|-----------|-------|
| గేలుపు దీశగా అందరికీ అదరగొట్టిన అక్క | Positive\|Positive\|Neutral | గేలుపు,దీశగా,అదరగొట్టిన\|గేలుపు\| | Positive |

---

## Model Selection

Models considered for training and evaluation:

1. **bert-base-multilingual-cased** (used for tuning and baseline)
2. **ai4bharat/IndicBERTv2-MLM-only**
3. **google/muril-base-cased**
4. **FacebookAI/xlm-roberta-base**
5. **l3cube-pune/telugu-bert**

---

## Pipeline Steps

### 1. Hyperparameter Tuning

**Scripts:**  
- With rationale: `hyperparameter_tuning_for_rationale.py`  
- Without rationale: `hyperparameter_tuning_without_rationale.py`

- Grid search over learning rate, batch size, and (for rationale models) rationale loss weight (`lambda`).
- Conducted separately for models trained **with** and **without** human rationale supervision.
- Results are saved as CSVs with detailed metrics for each configuration.

### 2. Model Training

**Scripts:**  
- With rationale: `model_training_with_rationale.py`  
- Without rationale: `model_training_without_rationale.py`

- Trains models using selected hyperparameters from tuning.
- Both approaches (with and without rationale supervision) are supported.
- Trained models and tokenizers are saved for downstream evaluation.

### 3. FERRET Faithfulness Evaluation

**Script:** `ferret_faithfullness.py`  
**Input:** Predictions and explanations from trained models.

- Runs model prediction on the test set.
- Retains only "matched" samples (where prediction equals ground-truth label).
- Generates and evaluates FERRET explanations for faithfulness:
  - Faithfulness metrics reflect how well the explanation supports the model's own prediction.
- **Metric aggregation:**  
  - The average of each faithfulness metric **over all sentences** gives the value reported in papers.

**Output:** `<model_name>_ferret_matched.csv` (faithfulness metrics per sentence).

### 4. FERRET Plausibility Evaluation

**Script:** `ferret_plausibility.py`  
**Input:** Output file from Step 3 (`<model_name>_ferret_matched.csv`).

- For each matched sample:
  - Generates attention vectors from human rationales (for each annotator).
  - Evaluates FERRET explanations for plausibility against each annotator's rationale using metrics such as AUPRC, token-wise F1, and IoU.
- **Metric aggregation:**  
  - For each metric, average **over all annotators and all sentences** is computed.  
  - These averages are the plausibility scores presented in papers.

**Output:** `<model_name>_ferret_plausibility.csv` (plausibility metrics per sentence and annotator).

---

## Metric Aggregation

- **Faithfulness Metrics:**  
  - For each metric in `<model_name>_ferret_matched.csv`, compute the average **across all sentences**.
  - These are reported as overall faithfulness scores.

- **Plausibility Metrics:**  
  - For each metric in `<model_name>_ferret_plausibility.csv`, compute the average **across all annotators and all sentences**.
  - These are reported as overall plausibility scores (per metric).

---

## How to Run

1. **Prepare dataset:** Format train, validation, and test CSVs as described above.
2. **Add emoji vocabulary:** Place `emoji.csv` in the project root.
3. **Hyperparameter tuning:**
   ```bash
   python hyperparameter_tuning_for_rationale.py
   python hyperparameter_tuning_without_rationale.py
   ```
4. **Train final models:**
   ```bash
   python model_training_with_rationale.py
   python model_training_without_rationale.py
   ```
5. **FERRET Faithfulness evaluation:**
   ```bash
   python ferret_faithfullness.py
   ```
6. **FERRET Plausibility evaluation:**
   ```bash
   python ferret_plausibility.py
   ```

*Edit script configs (model names, paths, batch sizes) as needed.*

---

## Outputs

- **Hyperparameter tuning results:** `grid_results_detailed.csv`
- **Model training:** Model weights, tokenizer, and metric CSVs.
- **Faithfulness metrics:** `<model_name>_ferret_matched.csv`
- **Plausibility metrics:** `<model_name>_ferret_plausibility.csv`
- **Test metrics & predictions:** `overall_test_metrics.csv`, `labelwise_test_metrics.csv`, `test_predictions.csv`, `confusion_matrix.csv`, `confusion_matrix.png`
- **Metric averages:** Compute using provided scripts or pandas for reporting.

---