emotion-classifier-bert-fa-v1 / README.md

Update README.md

e47c57d verified 2 months ago

5.53 kB

	---
	license: mit
	language:
	- fa
	metrics:
	- accuracy
	- f1
	base_model:
	- HooshvareLab/bert-base-parsbert-uncased
	pipeline_tag: text-classification
	library_name: transformers
	---
	# Model Card for aref-j/emotion-classifier-bert-fa-v1
	<!-- Provide a quick summary of what the model is/does. -->
	This is a fine-tuned BERT model for classifying emotions in Persian text, specifically detecting 6 emotion categories: ANGRY, FEAR, HAPPY, HATE, SAD, SURPRISE. It was developed using a merged dataset of Persian emotion corpora and is designed for applications like sentiment analysis on Persian tweets.

	## Model Details
	### Model Description
	<!-- Provide a longer summary of what this model is. -->
	This model is a fine-tuned version of ParsBERT (HooshvareLab/bert-base-parsbert-uncased) for emotion classification in Persian text. It uses a BERT base architecture with a sequence classification head to predict one of six emotion labels from input text. The model addresses class imbalance through weighted cross-entropy loss and was trained on a combined dataset of Persian tweets and short texts.
	- Developed by: Aref Jafary
	- Model type: Text classification (fine-tuned BERT)
	- Language(s) (NLP): Persian (fa)
	- License: MIT
	- Finetuned from model: HooshvareLab/bert-base-parsbert-uncased
	### Model Sources
	<!-- Provide the basic links for the model. -->
	- Repository: https://github.com/ArefJafary/Persian-Emotion-Classification-BERT


	## How to Get Started with the Model
	Use the code below to get started with the model.
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	model_name = "aref-j/emotion-classifier-bert-fa-v1"

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Create the classification pipeline
	classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

	# Example usage
	result = classifier("چه هوای زیبایی امروز است")
	print(result) # e.g. [{'label': 'HAPPY', 'score': 0.99}]
	```
	## Training Details
	### Training Data
	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
	The model was trained on a merged dataset from three Persian emotion corpora:
	- ArmanEmo: Over 7,000 Persian sentences labeled for 7 emotions., [GitHub](https://github.com/Arman-Rayan-Sharif/arman-text-emotion)
	- EmoPars: 30,000 Persian tweets labeled with 6 basic emotions (Anger, Fear, Happiness, Sadness, Hatred, Wonder).[GitHub](https://github.com/nazaninsbr/Persian-Emotion-Detection)
	- ShortPersianEmo: 5,472 short Persian texts labeled for 5 emotions (angry, sad, fear, happy, neutral). [GitHub](https://github.com/vkiani/ShortPersianEmo)

	Datasets were standardized, cleaned (normalization with Parsivar, removal of URLs, mentions, emojis, etc.), deduplicated, and split into 90% train / 10% validation, with ArmanEmo held out for testing.
	### Training Procedure
	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
	#### Preprocessing
	Text was normalized using Parsivar, with character mapping, diacritic removal, and stripping of URLs, mentions, hashtags, emojis, punctuation, digits, and extra spaces. Multi-label instances in EmoPars were converted to single-label via dominant label.
	#### Training Hyperparameters
	- Training regime: fp32 (assumed, not specified)
	- Batch size: 32
	- Epochs: 6
	- Learning rate: 1e-5
	- Optimizer: Not specified (default Hugging Face Trainer)
	- Loss: Weighted cross-entropy to handle class imbalance
	- Early stopping: After 2 epochs without validation loss improvement

	## Evaluation
	<!-- This section describes the evaluation protocols and provides the results. -->
	### Testing Data, Factors & Metrics
	#### Testing Data
	<!-- This should link to a Dataset Card if possible. -->
	Held-out ArmanEmo test set.
	#### Factors
	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
	Evaluation disaggregated by emotion classes (ANGRY, FEAR, HAPPY, HATE, SAD, SURPRISE).
	#### Metrics
	<!-- These are the evaluation metrics being used, ideally with a description of why. -->
	Accuracy (overall correct predictions), Macro F1-score (average F1 across classes, treating all equally), Precision, Recall, and Confusion Matrix.
	### Results
	- Test Accuracy: 70.88%
	- Macro F1-Score: 66.35%

	Detailed per-class metrics and confusion matrix available in the repository.


	BibTeX:
	```
	@misc{jafary2023persianemotion,
	author = {Aref Jafary},
	title = {Persian Emotion Classification with BERT},
	year = {2023},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/ArefJafary/Persian-Emotion-Classification-BERT}}
	}
	```

	APA:
	Jafary, A. (2023). Persian Emotion Classification with BERT [Repository]. GitHub. https://github.com/ArefJafary/Persian-Emotion-Classification-BERT
	## Glossary
	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
	- ParsBERT: A BERT model pre-trained on Persian text.
	- Weighted Cross-Entropy: Loss function that assigns higher weights to underrepresented classes.

	## Model Card Contact
	Contact via GitHub: https://github.com/ArefJafary