banglabert-sentiment-analysis / README.md

Update README.md

2f7a76e verified 5 months ago

5.2 kB

	---
	license: apache-2.0
	datasets:
	- SayedShaun/sentigold
	language:
	- bn
	metrics:
	- accuracy
	- f1
	base_model:
	- csebuetnlp/banglabert
	pipeline_tag: text-classification
	tags:
	- sentiment-analysis
	- bengali
	- bangla
	- multilabel-classification
	library_name: transformers
	---

	# BanglaBERT Fine-tuned for Bangla Sentiment Analysis

	## Model Description

	This model is a fine-tuned version of [`csebuetnlp/banglabert`](https://huggingface.co/csebuetnlp/banglabert) on the [SentiGOLD](https://arxiv.org/pdf/2306.06147) dataset for 5-class sentiment analysis in Bengali. It classifies text into:
	1. 😠 Very Negative (SN)
	2. 😞 Negative (WN)
	3. 😐 Neutral (N)
	4. 😊 Positive (WP)
	5. 😍 Very Positive (SP)

	Key Features:
	- State-of-the-art Bangla language understanding
	- Handles both formal and informal Bengali text
	- Optimized for social media, reviews, and customer feedback
	- Requires text normalization using [Bangla Normalizer](https://github.com/csebuetnlp/normalizer)

	## Intended Uses & Limitations

	### Primary Use
	- Sentiment analysis of Bengali text
	- Social media monitoring
	- Customer feedback analysis
	- Product review classification

	### Limitations
	- Performance may degrade on code-mixed text (Bengali-English)
	- May struggle with sarcasm and highly contextual expressions
	- Best for short to medium-length texts (up to 512 tokens)

	## Training Data

	The model was fine-tuned on SentiGOLD, the largest gold-standard Bangla sentiment analysis dataset:

	\| Feature \| Value \|
	\|------------------------\|---------------\|
	\| Total Samples \| 70,000 \|
	\| Domains Covered \| 30+ \|
	\| Source Diversity \| Social media, news, blogs, reviews \|
	\| Class Distribution \| Balanced across 5 classes \|
	\| Annotation Quality \| Fleiss' kappa = 0.88 \|

	## Training Procedure

	### Hyperparameters

	\| Parameter \| Value \|
	\| --- \| --- \|
	\| Learning Rate \| 2e-5 → 1.05e-6 \|
	\| Batch Size \| 48 \|
	\| Epochs \| 5 \|
	\| Optimizer \| AdamW \|
	\| Scheduler \| ReduceLROnPlateau \|
	\| Weight Decay \| 0.01 \|
	\| Gradient Accumulation \| 4 steps \|
	\| Warmup Ratio \| 5% \|

	### Techniques

	* Class-weighted loss handling imbalance
	* Early stopping (patience=3)
	* Mixed precision (FP16) training
	* Gradient checkpointing
	* Text normalization using Bangla Normalizer

	## Evaluation Results

	### Validation Performance

	\| Epoch \| F1 (Macro) \| Accuracy \| Very Neg F1 \| Neg F1 \| Neu F1 \| Pos F1 \| Very Pos F1 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| 1 \| 0.6334 \| 0.6331 \| 0.6789 \| 0.5834 \| 0.6407 \| 0.5635 \| 0.7004 \|
	\| 5 \| 0.6537 \| 0.6551 \| 0.7081 \| 0.6157 \| 0.6421 \| 0.5789 \| 0.7236 \|

	### Final Test Performance

	\| Metric \| Score \|
	\| --- \| --- \|
	\| Macro F1 \| 0.6660 \|
	\| Accuracy \| 0.6671 \|

	## How to Use

	### Direct Inference

	```python
	from transformers import pipeline
	from normalizer import normalize

	# Load model
	classifier = pipeline(
	"text-classification",
	model="ahs95/banglabert-sentiment-analysis",
	tokenizer="ahs95/banglabert-sentiment-analysis"
	)

	# Prepare text
	text = "আপনার পণ্যটি অসাধারণ! আমি খুবই সন্তুষ্ট।"
	normalized_text = normalize(text) # Important for BanglaBERT

	# Classify
	result = classifier(normalized_text)
	print(f"Sentiment: {result[0]['label']} (Confidence: {result[0]['score']:.2f})")
	```

	### Advanced Usage
	```python

	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	from normalizer import normalize

	# Load model and tokenizer
	model_name = "ahs95/banglabert-sentiment-analysis"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Prepare inputs
	texts = [
	"সেবা খুব খারাপ ছিল। আমি কখনো ফিরে আসব না।",
	"পণ্যটির গুণগত মান মোটামুটি ভাল"
	]
	normalized_texts = [normalize(t) for t in texts]

	# Tokenize and predict
	inputs = tokenizer(normalized_texts, padding=True, truncation=True, return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs)
	probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)

	# Get predictions
	sentiment_labels = ["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"]
	predictions = [sentiment_labels[p] for p in probabilities.argmax(dim=1)]

	for text, pred in zip(texts, predictions):
	print(f"Text: {text}\nPredicted Sentiment: {pred}\n")
	```

	### Ethical Considerations
	- Bias: While SentiGOLD reduces bias through synthetic data, real-world validation is recommended

	- Use Cases: Suitable for:
	* Product feedback analysis
	* Social media monitoring
	* Market research

	- Avoid: Critical decision systems without human oversight

	### Citation
	If you use this model, please cite:

	```bibtex
	@misc{banglabert-sentiment,
	author = {Arshadul Hoque},
	title = {Fine-tuned BanglaBERT for Bengali Sentiment Analysis},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/ahs95/banglabert-sentiment-analysis}}
	}
	```

	### Contact
	For questions and support: ahsbd95@gmail.com