Samra02 commited on
Commit
8775993
·
verified ·
1 Parent(s): ee1af67

Update README.md

Browse files

---
language:
- en
- ar
- fr
license: apache-2.0
tags:
- text-classification
- emotion-analysis
- multilingual
- transformers
- pytorch
- fine-tuned
base_model: AnasAlokla/multilingual_go_emotions_V1.2
datasets:
- go_emotions
metrics:
- accuracy
- f1
pipeline_tag: text-classification
---

# Multilingual Emotion Analysis — 5 Classes

A fine-tuned multilingual emotion classifier that detects **5 core emotions** across **English, Arabic, and French** text.

Fine-tuned from [`AnasAlokla/multilingual_go_emotions_V1.2`](https://huggingface.co/AnasAlokla/multilingual_go_emotions_V1.2) (27-class GoEmotions) by collapsing the label space to 5 psychologically grounded emotion categories.

---

## Labels

| ID | Label | Mapped from (GoEmotions) |
|----|-------|--------------------------|
| 0 | `joy` | joy, amusement, excitement, gratitude, love, optimism, pride, relief, caring |
| 1 | `sadness` | sadness, grief, disappointment, remorse |
| 2 | `anger` | anger, annoyance, disapproval |
| 3 | `fear` | fear, nervousness |
| 4 | `disgust` | disgust, embarrassment |

> Labels not listed above (admiration, confusion, curiosity, desire, realization, surprise, neutral, approval) were excluded from training as they do not map cleanly to any of the 5 target emotions.

---

## Quick Start

```python
from transformers import pipeline

clf = pipeline(
"text-classification",
model="Samra02/multilingual_emotion_analysis",
top_k=1,
)

# English
clf("I am so happy and grateful for everything!")
# [{'label': 'joy', 'score': 0.9731}]

# Arabic
clf("أنا خائف جداً مما قد يحدث")
# [{'label': 'fear', 'score': 0.9412}]

# French
clf("C'est absolument dégoûtant, je n'arrive pas à y croire.")
# [{'label': 'disgust', 'score': 0.9187}]
```

Get all 5 scores at once:

```python
clf = pipeline(
"text-classification",
model="Samra02/multilingual_emotion_analysis",
top_k=5,
)
results = clf("He was furious and couldn't calm down.")
# Returns all 5 emotions ranked by confidence
```

---

## Model Details

| Property | Value |
|----------|-------|
| Base model | `AnasAlokla/multilingual_go_emotions_V1.2` |
| Architecture | XLM-RoBERTa (transformer encoder) |
| Task | Single-label text classification |
| Number of labels | 5 |
| Max input length | 128 tokens |
| Languages | English · Arabic · French |
| Training data | GoEmotions (simplified, mapped 27 → 5) |
| Framework | PyTorch + Hugging Face Transformers |

---

## Training Procedure

Training followed a **two-phase strategy** to avoid destroying the pretrained multilingual representations when replacing the 27-class head with a fresh 5-class head.

### Phase 1 — Head warmup (1 epoch)

- Backbone frozen, only the new 5-class classification head trained
- Learning rate: `1e-3`
- Batch size: 32

### Phase 2 — Full fine-tuning (up to 5 epochs)

- All layers unfrozen
- Learning rate: `2e-5` with cosine schedule
- Warmup ratio: `0.1`
- Weight decay: `0.01`
- Early stopping patience: 2 epochs

### Key fix applied

The base model was trained as a **multi-label** classifier (`BCEWithLogitsLoss`). To switch to single-label classification (`CrossEntropyLoss`), the model was loaded with:

```python
AutoModelForSequenceClassification.from_pretrained(
"AnasAlokla/multilingual_go_emotions_V1.2",
num_labels=5,
ignore_mismatched_sizes=True,
problem_type="single_label_classification", # critical
)
```

### Hyperparameters

```yaml
seed: 42
max_length: 128
batch_size: 32
phase1_lr: 1e-3
phase2_lr: 2e-5
phase2_epochs: 5
lr_scheduler: cosine
warmup_ratio: 0.1
weight_decay: 0.01
optimizer: AdamW
fp16: true
```

---

## Evaluation Results

Evaluated on the GoEmotions **test set** after label mapping (single-label samples only).

### Overall

| Metric | Score |
|--------|-------|
| Accuracy | 0.92 |
| F1 Macro | 0.89 |
| F1 Weighted | 0.92 |

### Per-class F1

| Emotion | F1 |
|---------|----|
| joy | 0.97 |
| sadness | 0.85 |
| anger | 0.89 |
| fear | 0.94 |
| disgust | 0.81 |

---

## Dataset

Training data: **GoEmotions** (`simplified` split) — a large-scale English Reddit comment dataset annotated for 27 emotions, published by Google Research.

```python
from datasets import load_dataset
dataset = load_dataset("go_emotions", "simplified")
```

After filtering multi-label samples and dropping unmapped classes, the effective dataset sizes are approximately:

| Split | Approx. samples (after filtering) |
|-------|----------------------------------|
| Train | ~30,000 |
| Validation | ~3,700 |
| Test | ~3,700 |

**Language coverage:** GoEmotions is English-only. Arabic and French coverage comes from the multilingual pretraining of the XLM-RoBERTa backbone in the base model, which was trained on 100+ languages. Cross-lingual transfer is zero-shot for Arabic and French — no translated or native-language emotion data was added during fine-tuning.

---

## Intended Use

This model is intended for:

- Sentiment and emotion analysis in customer feedback, social media, and support tickets
- Multilingual NLP pipelines requiring a compact emotion label space
- Research on cross-lingual emotion transfer

### Out of scope

- Fine-grained emotion detection beyond 5 classes (use the 27-class base model instead)
- Languages other than English, Arabic, and French (performance not evaluated)
- Clinical, medical, or mental health applications — this model is not validated for such use cases

---

## Limitations & Bias

- **Training language:** GoEmotions is English Reddit data. The model may carry cultural and demographic biases present in that corpus.
- **Cross-lingual gap:** Arabic and French performance relies on zero-shot transfer and has not been benchmarked on native-language emotion datasets. Expect lower accuracy than English.
- **Label collapse:** Merging 27 classes into 5 loses nuance. For example, `gratitude` and `love` are both mapped to `joy`, which may not suit all applications.
- **Domain sensitivity:** Trained on social media text. Performance may degrade on formal, literary, or domain-specific text (e.g. legal, medical).

---

## How to Evaluate

```python
from transformers import pipeline
from datasets import load_dataset
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix

# Load model
clf = pipeline(
"text-classification",
model="Samra02/multilingual_emotion_analysis",
top_k=1,
device=0, # -1 for CPU
)

# --- reproduce label mapping ---
GO_EMOTIONS_LABELS = [
'admiration','amusement','anger','annoyance','approval',
'caring','confusion','curiosity','desire','disappointment',
'disapproval','disgust','embarrassment','excitement','fear',
'gratitude','grief','joy','love','nervousness',
'optimism','pride','realization','relief','remorse',
'sadness','surprise','neutral'
]
GO_TO_5 = {
'joy':'joy','amusement':'joy','excitement':'joy','gratitude':'joy',
'love':'joy','optimism':'joy','pride':'joy','relief':'joy','caring':'joy',
'sadness':'sadness','grief':'sadness','disappointment':'sadness','remorse':'sadness',
'anger':'anger','annoyance':'anger','disapproval':'anger',
'fear':'fear','nervousness':'fear',
'disgust':'disgust','embarrassment':'disgust',
}
TARGET_LABELS = ['joy','sadness','anger','fear','disgust']
LABEL2ID = {l:i for i,l in enumerate(TARGET_LABELS)}

raw = load_dataset("go_emotions","simplified")["test"]

texts, true_labels = [], []
for ex in raw:
if len(ex["labels"]) != 1:
continue
name = GO_EMOTIONS_LABELS[ex["labels"][0]]
mapped = GO_TO_5.get(name)
if mapped is None:
continue
texts.append(ex["text"])
true_labels.append(LABEL2ID[mapped])

preds = [LABEL2ID[clf(t)[0][0]["label"]] for t in texts]

print(classification_report(true_labels, preds, target_names=TARGET_LABELS))
print(confusion_matrix(true_labels, preds))
```

---

## Citation

If you use this model in your work, please cite the base model and dataset:

```bibtex
@inproceedings{demszky-etal-2020-goemotions,
title = {{GoEmotions}: A Dataset of Fine-Grained Emotions},
author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo
and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
booktitle = {Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics},
year = {2020},
url = {https://aclanthology.org/2020.acl-main.372},
}
```

```bibtex
@misc {samra02-emotion5,
author = {Samra02},
title = {Multilingual Emotion Analysis (5-class)},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/Samra02/multilingual_emotion_analysis},
}
```

---

## Model Files

```
Samra02/multilingual_emotion_analysis/
├── config.json # model config with id2label, problem_type
├── tokenizer_config.json
├── tokenizer.json
├── special_tokens_map.json
├── sentencepiece.bpe.model # XLM-RoBERTa sentencepiece vocab
└── model.safetensors # fine-tuned weights
```

---

*Model fine-tuned using Hugging Face Transformers and trained on Google Colab (T4 GPU).*

Files changed (1) hide show
  1. README.md +13 -3
README.md CHANGED
@@ -1,3 +1,13 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - google-research-datasets/go_emotions
5
+ language:
6
+ - ar
7
+ - fr
8
+ - en
9
+ metrics:
10
+ - accuracy
11
+ - f1
12
+ pipeline_tag: text-classification
13
+ ---