aref-j's picture
Update README.md
e47c57d verified
---
license: mit
language:
- fa
metrics:
- accuracy
- f1
base_model:
- HooshvareLab/bert-base-parsbert-uncased
pipeline_tag: text-classification
library_name: transformers
---
# Model Card for aref-j/emotion-classifier-bert-fa-v1
<!-- Provide a quick summary of what the model is/does. -->
This is a fine-tuned BERT model for classifying emotions in Persian text, specifically detecting 6 emotion categories: ANGRY, FEAR, HAPPY, HATE, SAD, SURPRISE. It was developed using a merged dataset of Persian emotion corpora and is designed for applications like sentiment analysis on Persian tweets.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This model is a fine-tuned version of ParsBERT (HooshvareLab/bert-base-parsbert-uncased) for emotion classification in Persian text. It uses a BERT base architecture with a sequence classification head to predict one of six emotion labels from input text. The model addresses class imbalance through weighted cross-entropy loss and was trained on a combined dataset of Persian tweets and short texts.
- **Developed by:** Aref Jafary
- **Model type:** Text classification (fine-tuned BERT)
- **Language(s) (NLP):** Persian (fa)
- **License:** MIT
- **Finetuned from model:** HooshvareLab/bert-base-parsbert-uncased
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/ArefJafary/Persian-Emotion-Classification-BERT
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "aref-j/emotion-classifier-bert-fa-v1"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create the classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Example usage
result = classifier("چه هوای زیبایی امروز است")
print(result) # e.g. [{'label': 'HAPPY', 'score': 0.99}]
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The model was trained on a merged dataset from three Persian emotion corpora:
- **ArmanEmo**: Over 7,000 Persian sentences labeled for 7 emotions., [GitHub](https://github.com/Arman-Rayan-Sharif/arman-text-emotion)
- **EmoPars**: 30,000 Persian tweets labeled with 6 basic emotions (Anger, Fear, Happiness, Sadness, Hatred, Wonder).[GitHub](https://github.com/nazaninsbr/Persian-Emotion-Detection)
- **ShortPersianEmo**: 5,472 short Persian texts labeled for 5 emotions (angry, sad, fear, happy, neutral). [GitHub](https://github.com/vkiani/ShortPersianEmo)
Datasets were standardized, cleaned (normalization with Parsivar, removal of URLs, mentions, emojis, etc.), deduplicated, and split into 90% train / 10% validation, with ArmanEmo held out for testing.
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Preprocessing
Text was normalized using Parsivar, with character mapping, diacritic removal, and stripping of URLs, mentions, hashtags, emojis, punctuation, digits, and extra spaces. Multi-label instances in EmoPars were converted to single-label via dominant label.
#### Training Hyperparameters
- **Training regime:** fp32 (assumed, not specified)
- Batch size: 32
- Epochs: 6
- Learning rate: 1e-5
- Optimizer: Not specified (default Hugging Face Trainer)
- Loss: Weighted cross-entropy to handle class imbalance
- Early stopping: After 2 epochs without validation loss improvement
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
Held-out ArmanEmo test set.
#### Factors
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
Evaluation disaggregated by emotion classes (ANGRY, FEAR, HAPPY, HATE, SAD, SURPRISE).
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
Accuracy (overall correct predictions), Macro F1-score (average F1 across classes, treating all equally), Precision, Recall, and Confusion Matrix.
### Results
- Test Accuracy: 70.88%
- Macro F1-Score: 66.35%
Detailed per-class metrics and confusion matrix available in the repository.
**BibTeX:**
```
@misc{jafary2023persianemotion,
author = {Aref Jafary},
title = {Persian Emotion Classification with BERT},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ArefJafary/Persian-Emotion-Classification-BERT}}
}
```
**APA:**
Jafary, A. (2023). Persian Emotion Classification with BERT [Repository]. GitHub. https://github.com/ArefJafary/Persian-Emotion-Classification-BERT
## Glossary
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
- **ParsBERT**: A BERT model pre-trained on Persian text.
- **Weighted Cross-Entropy**: Loss function that assigns higher weights to underrepresented classes.
## Model Card Contact
Contact via GitHub: https://github.com/ArefJafary