|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- fa |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
base_model: |
|
|
- HooshvareLab/bert-base-parsbert-uncased |
|
|
pipeline_tag: text-classification |
|
|
library_name: transformers |
|
|
--- |
|
|
# Model Card for aref-j/emotion-classifier-bert-fa-v1 |
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
This is a fine-tuned BERT model for classifying emotions in Persian text, specifically detecting 6 emotion categories: ANGRY, FEAR, HAPPY, HATE, SAD, SURPRISE. It was developed using a merged dataset of Persian emotion corpora and is designed for applications like sentiment analysis on Persian tweets. |
|
|
|
|
|
## Model Details |
|
|
### Model Description |
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
This model is a fine-tuned version of ParsBERT (HooshvareLab/bert-base-parsbert-uncased) for emotion classification in Persian text. It uses a BERT base architecture with a sequence classification head to predict one of six emotion labels from input text. The model addresses class imbalance through weighted cross-entropy loss and was trained on a combined dataset of Persian tweets and short texts. |
|
|
- **Developed by:** Aref Jafary |
|
|
- **Model type:** Text classification (fine-tuned BERT) |
|
|
- **Language(s) (NLP):** Persian (fa) |
|
|
- **License:** MIT |
|
|
- **Finetuned from model:** HooshvareLab/bert-base-parsbert-uncased |
|
|
### Model Sources |
|
|
<!-- Provide the basic links for the model. --> |
|
|
- **Repository:** https://github.com/ArefJafary/Persian-Emotion-Classification-BERT |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
Use the code below to get started with the model. |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline |
|
|
|
|
|
model_name = "aref-j/emotion-classifier-bert-fa-v1" |
|
|
|
|
|
# Load tokenizer and model |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Create the classification pipeline |
|
|
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) |
|
|
|
|
|
# Example usage |
|
|
result = classifier("چه هوای زیبایی امروز است") |
|
|
print(result) # e.g. [{'label': 'HAPPY', 'score': 0.99}] |
|
|
``` |
|
|
## Training Details |
|
|
### Training Data |
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
The model was trained on a merged dataset from three Persian emotion corpora: |
|
|
- **ArmanEmo**: Over 7,000 Persian sentences labeled for 7 emotions., [GitHub](https://github.com/Arman-Rayan-Sharif/arman-text-emotion) |
|
|
- **EmoPars**: 30,000 Persian tweets labeled with 6 basic emotions (Anger, Fear, Happiness, Sadness, Hatred, Wonder).[GitHub](https://github.com/nazaninsbr/Persian-Emotion-Detection) |
|
|
- **ShortPersianEmo**: 5,472 short Persian texts labeled for 5 emotions (angry, sad, fear, happy, neutral). [GitHub](https://github.com/vkiani/ShortPersianEmo) |
|
|
|
|
|
Datasets were standardized, cleaned (normalization with Parsivar, removal of URLs, mentions, emojis, etc.), deduplicated, and split into 90% train / 10% validation, with ArmanEmo held out for testing. |
|
|
### Training Procedure |
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
#### Preprocessing |
|
|
Text was normalized using Parsivar, with character mapping, diacritic removal, and stripping of URLs, mentions, hashtags, emojis, punctuation, digits, and extra spaces. Multi-label instances in EmoPars were converted to single-label via dominant label. |
|
|
#### Training Hyperparameters |
|
|
- **Training regime:** fp32 (assumed, not specified) |
|
|
- Batch size: 32 |
|
|
- Epochs: 6 |
|
|
- Learning rate: 1e-5 |
|
|
- Optimizer: Not specified (default Hugging Face Trainer) |
|
|
- Loss: Weighted cross-entropy to handle class imbalance |
|
|
- Early stopping: After 2 epochs without validation loss improvement |
|
|
|
|
|
## Evaluation |
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
### Testing Data, Factors & Metrics |
|
|
#### Testing Data |
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
Held-out ArmanEmo test set. |
|
|
#### Factors |
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
Evaluation disaggregated by emotion classes (ANGRY, FEAR, HAPPY, HATE, SAD, SURPRISE). |
|
|
#### Metrics |
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
Accuracy (overall correct predictions), Macro F1-score (average F1 across classes, treating all equally), Precision, Recall, and Confusion Matrix. |
|
|
### Results |
|
|
- Test Accuracy: 70.88% |
|
|
- Macro F1-Score: 66.35% |
|
|
|
|
|
Detailed per-class metrics and confusion matrix available in the repository. |
|
|
|
|
|
|
|
|
**BibTeX:** |
|
|
``` |
|
|
@misc{jafary2023persianemotion, |
|
|
author = {Aref Jafary}, |
|
|
title = {Persian Emotion Classification with BERT}, |
|
|
year = {2023}, |
|
|
publisher = {GitHub}, |
|
|
journal = {GitHub repository}, |
|
|
howpublished = {\url{https://github.com/ArefJafary/Persian-Emotion-Classification-BERT}} |
|
|
} |
|
|
``` |
|
|
|
|
|
**APA:** |
|
|
Jafary, A. (2023). Persian Emotion Classification with BERT [Repository]. GitHub. https://github.com/ArefJafary/Persian-Emotion-Classification-BERT |
|
|
## Glossary |
|
|
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. --> |
|
|
- **ParsBERT**: A BERT model pre-trained on Persian text. |
|
|
- **Weighted Cross-Entropy**: Loss function that assigns higher weights to underrepresented classes. |
|
|
|
|
|
## Model Card Contact |
|
|
Contact via GitHub: https://github.com/ArefJafary |